Ensure reliable, on-target Gen-AI responses
Protect intellectual property and ensure compliance
Safely navigate GenAI: Detect and avoid off-topic conversations
Keep interactions tasteful, filter NSFW content
Secure company data: Detect and anonymize sensitive info
Shield data from smart LLM SQL queries
Detect and filter out malicious input for prompt integrity
Safeguard LLM: Keep model instructions confidential
Explore LLM interactions for user engagement insights
Track costs, queries, and tokens for budget control
Tailored production ML dashboards to monitor key metrics
Real-time ML monitoring to detect drifts and monitor predictions
Direct Data Connectors: Monitor and observe billions of predictions
Root Cause Analysis to gain actionable insights and explore model predictions
LLM Observability for your ML: Monitor, troubleshoot and enhance efficiency
Explainable AI to understand, ensure trust, and communicate predictions
Tailored Aporia Observe for your models: Integrate any model in minutes
Integrate Aporia to every LLM and tool in the market
Empower tabular models with Aporia
Streamline AI Act compliance with Aporia Guardrails and Observe
Unlock potential in CV & NLP models
A team of Cybersecurity, Compliance, and AI Experts that ensures Aporia users top-tier protection
Optimize LLM & GenAI apps for peak performance
Your go-to resource for Aporia insights and guides
Integrate Aporia to your LLM as a Proxy with Guardrail Policies
Integrate Aporia with Your Firewall for AI Tool Security
Easily Integrate and Monitor ML Models in Production
Define ML Observability Resources as Code with SDK
Learn about AI control from our experts
Your dictionary for AI terminology.
Step-by-step guides to master AI
Dive into our GitHub projects and examples
Unlock AI secrets with our eBooks
Elevate your GenAI and LLM knwoledge
Navigate the core of ML observability
Metrics, feature importance and more
Machine learning models are only as good as the data they ingest during and after training. Data drift refers to a change in the distribution of a model’s input data over time. In other words, it refers to a situation where the input data that a machine learning model was trained on no longer accurately represents the data that the model is being applied to.
Data drift can have a significant impact on the performance of machine learning models, as a model that was trained on a different distribution of data may not be able to accurately predict or classify new data. This can cause a model to become less accurate over time, or even lead to the model’s performance degrading rapidly.
It’s important to keep an eye on the performance of the model over time and keep track of any changes in the input data, so that data drift can be identified and addressed as soon as possible.
This is part of an extensive series of guides about machine learning.
Concept drift is when the relationship between the inputs and outputs of a machine learning model changes in the real world, compared to those relations when the model was trained. In other words, predictions generated by the model for certain inputs, which used to be correct, are no longer relevant.
For example, a model that was trained to detect fraudulent credit card transactions may become less accurate over time as criminals change their tactics. This is the most basic form of data drift.
Learn more in our detailed guide to concept drift
Covariate shift is similar to concept drift, but it is a more severe problem. In covariate drift, not only is there a shift in the relation between inputs and outputs, but in addition, the input data changes.
For example, a model that was trained on data from a specific geographical region may become less accurate when applied to data from a different region due to different cultural influences or purchasing habits. Here there is a change in the way the model needs to analyze inputs, and the inputs themselves are also different.
Prior probability shift occurs when the proportion of the different classes in the data changes over time. For example, if a binary classification model was trained to detect spam email, and the proportion of spam email in the population changes, the model’s performance may suffer as its prior probability assumptions are not accurate anymore.
The PSI is a measure of the change in the distribution of a feature between the training and test data. It is calculated as the difference in the cumulative probability of a feature between the two datasets. A high PSI value indicates a significant change in the distribution of the feature, which may indicate data drift.
The formula to calculate PSI looks like this:
PSI = ((Actual% – Expected%) * ln(Actual% * Expected%))
The Kolmogorov-Smirnov test is a non-parametric test that can be used to determine whether two samples come from the same distribution. This test can be used to detect data drift by comparing the distribution of the training data and the distribution of the test data.
The formula looks like this:
Dn,m = supx|F1,n(x) – F2,m(x)| Fn(x) = 12i=1nI[-,x](Xi)
F1,n(x) is the distribution function for previous data (n), while F2,m(x) is the distribution function for new data (m), and supx refers to the subset of x samples that maximizes the two functions.
KL divergence is a measure of the difference between two probability distributions. It can be used to detect data drift by comparing the distribution of the training data and the distribution of the test data.
Here is an example of the KL divergence formula with A and B representing the old and new data distributions, respectively:
KL(A||B) = – xB(x) * logA(x)B(x)
The divergence can be anything between 0 and infinity – score of 0 means the distributions are identical.
JS divergence is a symmetric version of the KL divergence method, which can be used to detect the similarity or dissimilarity between two probability distributions. Following is the formula used in JS divergence:
JS(B||A) = 12(KL(B||M) + KL(A||M))
Learn more in our detailed guide to data drift detection (coming soon)
Here are a few strategies that can be used to solve data drift:
By implementing these strategies, organizations can effectively address data drift and ensure that their machine learning models continue to perform well over time.
Learn more in our detailed guides to:
By identifying and addressing data drift early on, businesses can avoid the negative consequences of inaccurate predictions, such as lost revenue, reduced customer satisfaction, and increased operational costs. Thus, monitoring ML models for data drift is crucial for maintaining business continuity and maximizing the benefits of machine learning.
Aporia’s ML observability platform is the ideal partner for Data Scientists and ML engineers to visualize, monitor, explain, and improve ML models in production. Our platform fits naturally into your existing ML stack and seamlessly integrates with your existing ML infrastructure in minutes. We empower organizations with key features and tools to ensure high model performance:
Root Cause Investigation
To get a hands-on feel for Aporia’s advanced model monitoring and deep visualization tools, we recommend:
Book a demo to get a guided tour of Aporia’s capabilities, see ML observability in action, and understand how we can help you achieve your ML goals.
Together with our content partners, we have authored in-depth guides on several other topics that can also be useful as you explore the world of machine learning.
Authored by Cloudinary
Authored by Cynet
Authored by Run.AI