This guide will guide you through the challenges and strategies of monitoring Large Language Models. We’ll discuss potential model pitfalls, provide key metrics for performance assessment, and offer a practical checklist to ensure model accountability and efficacy. Through this knowledge, readers can optimize their LLM performance and get the most value from these intricate models.
Setting the Stage
You’ve been tasked with deploying a Large Language Model (LLM) for a new chatbot feature you’re rolling out and you want to make sure your LLM-powered chatbot is transparent and trustworthy. On top of that, you want to run sentiment analysis to derive cool new insights from your chatbot. Now that the scenario is set, let’s look at how you can monitor your LLM and generate insights for quick and seamless fine-tuning.
Source: Medium
Challenges in Monitoring Large Language Models
LLMs can be quite a handful. They’re big, complex, and the moment you think you’ve got them figured out, they throw a curveball.
*This is true for both proprietary and Open Source LLMs, as each type presents unique challenges and opportunities in monitoring and maintaining model integrity).
You want them to write like Hemingway but sometimes they seem to channel a philosophy major writing their first essay. Here are some challenges you might face:
Scale: The sheer size of LLMs means they generate heaps of data. Monitoring that tsunami of information is no easy task.
Quality & accuracy: What’s a quality output? Shakespeare and modern-day texting are both English but hardly the same thing. Defining what’s “good” isn’t black and white. This goes the same for measuring your LLMs accuracy.
Bias: These models can sometimes spew out biased or offensive content, and you want to catch that before it wreaks havoc.
From a bird’s eye view, these hurdles can stymie the unlocking of your LLM’s full prowess. Let’s delve into the underlying reasons for your language model’s shortcomings and explore how vigilant monitoring can be your catalyst in staying ahead of the game.
Improving LLM Accuracy: Reducing Hallucinations and Errors
LLMs, for all their brainy bits, have their blunders. Knowing when and where they’re goofing up is crucial. Let’s get into how you can keep tabs on hallucinations, bad responses, and funky prompts.
Monitoring hallucinations
Hallucinations in LLMs are when they start making things up. Not cool, right? Imagine your model pulling “facts” out of thin air! You need to keep a sharp eye on this. Set up an anomaly detection system that flags unusual patterns in the responses. You can also have a moderation layer that cross-checks facts with a reliable source. If your model claims that cats are plotting world domination, it’s probably time to rein it in.
Three reasons why LLMs hallucinate:
Training Bias – Biases in training data and lack of diversity impede generalization, while their complex architecture challenges researchers in pinpointing and correcting these issues.
Overfitting – Overfitting occurs when a LLM excessively adapts to its training data, resulting in a failure to generalize and potentially generating spurious, or “hallucinated,” content when faced with new, unseen data.
Bad Prompts – When input prompts are murky, at odds, or self-contradictory, the stage is set for hallucinations to emerge. Although users don’t have a say in data quality and training, they hold the reins when it comes to input context. By sharpening their input game, users can nudge the results towards improvement.
Identifying bad responses
Now, you know that sometimes LLMs can whip up responses that nobody really wants. Monitoring user feedback can be a gold mine here. If users are reacting negatively or appear confused, take that as a sign.
Keeping prompts in check
Prompts are like the breadcrumbs you give LLMs to follow. Sometimes, they take those crumbs and go off into a maze. To monitor this, keep an eye on how well the model’s responses align with the intent of the prompts. Misalignments can lead to responses that are out of place. You can do this by having a human-in-the-loop to validate a subset of responses or set up a system that scores alignment and flags any that are drifting off into the weeds.
Key Metrics to Track for Large Language Models
Let’s talk about the key metrics that you should be tracking:
Perplexity: This metric is akin to your LLM’s stress ball. It tells you how confused the model is with the tasks it’s trying to accomplish. Lower perplexity typically means the model is more confident in its outputs. (Not relevant to anyone using Open AI’s ChatGPT) How to calculate Perplexity: Perplexity = 2^(-Σ(p(x) * log2(p(x))))
Token Efficiency: LLMs, particularly models like GPT, consume tokens for both input and output. Tokens can be as short as one character or as long as one word. Monitoring how efficiently an LLM uses tokens is essential for controlling costs and ensuring that the model doesn’t exceed its maximum token limit. How to calculate Tokens efficiency: This is more of a practical metric, and there isn’t a standard formula for it. However, you can think of it in terms of tokens used relative to the total tokens available. For example, if a model has a maximum token limit of N, and it uses M tokens for a given input-output pair, the token efficiency could be represented as the ratio: Token Efficiency = M/N
Response Time: Be considerate of your users’ time. You don’t want your LLM to be that slowpoke who takes ages to respond. If it does, users might ditch it faster than you can say “cold cup of coffee”. How to calculate Response Time: This is usually measured in seconds or milliseconds and represents the time taken by the model to respond to a user’s request. It can be measured as the difference between the time when the response is received and the time when the request was sent. Response Time = Time(Response Received) – Time(Request Sent)
Accuracy and F1 Score: You need to know how often your model is hitting the bullseye or missing the mark. Accuracy gives you an idea of how often your model is right, while the F1 Score provides a balance between precision (how many of the predicted positives are actually positive) and recall (how many actual positives were predicted correctly). (only relevant for certain AI tasks, for example using LLMs for classification.) How to calculate Accuracy: Accuracy = (True Positives + True Negatives) / (True Positives + False Positives + True Negatives + False Negatives) How to calculate F1 Score: F1 Score = 2 * (Precision * Recall) / (Precision + Recall) Where Precision = True Positives / (True Positives + False Positives) and Recall = True Positives / (True Positives + False Negatives)
Throughput: When measuring the efficiency of your LLM, it is important to consider throughput – the number of requests or operations your model can handle per unit of time. This is particularly crucial in high-traffic environments where a high volume of requests needs to be processed swiftly. How to calculate Throughput: This is typically measured in requests per second (RPS) or operations per second and represents the number of requests or operations the model can handle in a unit of time. It can be calculated as: Throughput = Total Number of Requests / Total Time Taken
Drift: Changes in the data distribution over time can lead to a decrease in model performance, a phenomenon known as “drift”. Be sure to keep an eye on your model’s inputs and outputs for any changes. This can be crucial in the early detection of shifts in user behavior or environmental changes that your model needs to adapt to. How to calculate Drift: This is a complex metric to measure as it involves monitoring changes in the data distribution over time. There isn’t a single formula for drift, but it typically involves comparing the performance of the model on current data to its performance on historical data.
Fairness Metrics: LLMs are known to sometimes carry and perpetuate biases found in the data they were trained on. Organizations should monitor and quantify biases in model outputs using fairness metrics. This can be domain-specific and include metrics like gender bias, racial bias, or other forms of unintended bias. How to calculate Fairness: These metrics are often domain-specific and don’t have a one-size-fits-all formula. However, they generally involve comparing the performance or outcomes of the model across different demographic groups. For example, Gender Bias can be measured by comparing the model’s performance for male vs. female inputs.
The Checklist for LLMs in Production
While developers and organizations rush to implement LLMs into their products or create new products based on the GPTs of the world, in order to use these models in an effective way requires all ML stakeholders to ensure the responsibility, accountability, and transparency of these models. Keep tabs on these fundamental tasks to ensure the accuracy and performance of your LLM-powered AI product.
Disclaimer: Some items on the checklist pertain only when developing and deploying proprietary LLMs.
The Takeaway
Your LLM is like a talented but sometimes scatterbrained writer. By monitoring hallucinations, bad responses, and prompts, you can make sure your LLM stays on track and delivers the value you and your users are looking for. Make every word count.
Are you working with LLMs? Try out Aporia and see how LLM observability helps you keep track of your model performance, ensuring that every word counts, or chat with one of our LLM experts to learn more.
Introduction Machine learning algorithms, specifically in NLP, LLM, and computer vision models, often deal with high-dimensional and unstructured data, such...
We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
Cookie
Duration
Description
__cf_bm
1 hour
This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
__hssc
1 hour
HubSpot sets this cookie to keep track of sessions and to determine if HubSpot should increment the session number and timestamps in the __hstc cookie.
__hssrc
session
This cookie is set by Hubspot whenever it changes the session cookie. The __hssrc cookie set to 1 indicates that the user has restarted the browser, and if the cookie does not exist, it is assumed to be a new session.
_lfa
1 year
This cookie is set by the provider Leadfeeder to identify the IP address of devices visiting the website, in order to retarget multiple users routing from the same IP address.
AWSALBCORS
7 days
Amazon Web Services set this cookie for load balancing.
cookielawinfo-checkbox-advertisement
1 year
Set by the GDPR Cookie Consent plugin, this cookie records the user consent for the cookies in the "Advertisement" category.
cookielawinfo-checkbox-analytics
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional
11 months
The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent
1 year
CookieYes sets this cookie to record the default button state of the corresponding category and the status of CCPA. It works only in coordination with the primary cookie.
datadome
session
This is a security cookie set by Force24 to detect BOTS and malicious traffic.
JSESSIONID
session
New Relic uses this cookie to store a session identifier so that New Relic can monitor session counts for an application.
usprivacy
1 year
This is a consent cookie set by Dailymotion to store the CCPA consent string (mandatory information about an end-user being or not being a California consumer and exercising or not exercising its statutory right).
viewed_cookie_policy
11 months
The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Cookie
Duration
Description
li_gc
6 months
Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc
1 day
LinkedIn sets the lidc cookie to facilitate data center selection.
UserMatchHistory
1 month
LinkedIn sets this cookie for LinkedIn Ads ID syncing.
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.
Cookie
Duration
Description
_gat
2 minutes
Google Universal Analytics sets this cookie to restrain request rate and thus limit data collection on high-traffic sites.
_uetsid
1 day
Bing Ads sets this cookie to engage with a user that has previously visited the website.
_uetvid
1 year 24 days
Bing Ads sets this cookie to engage with a user that has previously visited the website.
AWSALB
7 days
AWSALB is an application load balancer cookie set by Amazon Web Services to map the session to the target.
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
Cookie
Duration
Description
__hstc
6 months
Hubspot set this main cookie for tracking visitors. It contains the domain, initial timestamp (first visit), last timestamp (last visit), current timestamp (this visit), and session number (increments for each subsequent session).
_fbp
3 months
Facebook sets this cookie to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising after visiting the website.
_ga
1 year 1 month 4 days
Google Analytics sets this cookie to calculate visitor, session and campaign data and track site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognise unique visitors.
_ga_*
1 year 1 month 4 days
Google Analytics sets this cookie to store and count page views.
_gat_gtag_UA_*
1 minute
Google Analytics sets this cookie to store a unique user ID.
_gat_UA-*
1 minute
Google Analytics sets this cookie for user behaviour tracking.n
_gcl_au
3 months
Google Tag Manager sets the cookie to experiment advertisement efficiency of websites using their services.
_gid
1 day
Google Analytics sets this cookie to store information on how visitors use a website while also creating an analytics report of the website's performance. Some of the collected data includes the number of visitors, their source, and the pages they visit anonymously.
_hjSession_*
1 hour
Hotjar sets this cookie to ensure data from subsequent visits to the same site is attributed to the same user ID, which persists in the Hotjar User ID, which is unique to that site.
_hjSessionUser_*
1 year
Hotjar sets this cookie to ensure data from subsequent visits to the same site is attributed to the same user ID, which persists in the Hotjar User ID, which is unique to that site.
_hjTLDTest
session
To determine the most generic cookie path that has to be used instead of the page hostname, Hotjar sets the _hjTLDTest cookie to store different URL substring alternatives until it fails.
_session_id
14 days
_session_id cookie stores a unique identifier for a user's session, allowing servers to identify and track user activities within a website or application.
ajs_anonymous_id
1 year
This cookie is set by Segment to count the number of people who visit a certain site by tracking if they have visited before.
ajs_user_id
never
This cookie is set by Segment to help track visitor usage, events, target marketing, and also measure application performance and stability.
AnalyticsSyncHistory
1 month
Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
hubspotutk
6 months
HubSpot sets this cookie to keep track of the visitors to the website. This cookie is passed to HubSpot on form submission and used when deduplicating contacts.
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.
Cookie
Duration
Description
_rdt_uuid
3 months
Reddit sets this cookie to build a profile of your interests and show you relevant ads.
bcookie
1 year
LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser IDs.
bscookie
1 year
LinkedIn sets this cookie to store performed actions on the website.
li_sugr
3 months
LinkedIn sets this cookie to collect user behaviour data to optimise the website and make advertisements on the website more relevant.
muc_ads
1 year 1 month 4 days
Twitter sets this cookie to collect user behaviour and interaction data to optimize the website.
MUID
1 year 24 days
Bing sets this cookie to recognise unique web browsers visiting Microsoft sites. This cookie is used for advertising, site analytics, and other operations.
personalization_id
1 year 1 month 4 days
Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.