Machine Learning
Model Monitoring 101

Learn more about machine learning model monitoring and ML model management with our in-depth guide.

Quick Links

ML Model Monitoring

What is Machine Learning (ML) Model Monitoring?

ML Monitoring is a set of techniques to observe ML models in production and ensure the reliability of their performance. ML models train by observing examples from a dataset and minimizing an error that stands for how well the model performs at the task for which it is training. Production ML models perform inference on changing data from a changing world after training on a static set of examples in development. This discrepancy between static training data in development and dynamic data in production causes the performance of a production model to degrade over time. 


Consider that you trained your model to detect credit card fraud based on pre-COVID user data. During a pandemic, credit card use and buying habits change. Such changes potentially expose your model to data from a distribution the model was not trained on. This is an example of data drift, one of several sources of model degradation. Without ML monitoring, your model will output incorrect predictions with no warning signs, which will negatively impact your customers and your organization in the long run. 

Machine learning model monitoring aims to use data science and statistical techniques to assess the quality of machine learning models in production continuously. 

Monitoring can serve different purposes: 

 1. Early detection of instabilities

 2. Understanding how and why model performance degrades

3. Diagnosing specific failure cases

Additionally, some ML monitoring platforms, like Aporia, can be used to not only track and evaluate model performance but to also investigate and debug, explain model predictions and improve model performance in production.

How to Monitor Machine Learning

Engineers monitor software because systems built today are vulnerable to uncertainties in a real-world deployment scenario. Similarly, ML models are software systems, but by nature, ML models are only as good as the data we feed them. Consequently, traditional software monitoring techniques are ineffective when applied to ML models.

An effective ML monitoring system must detect changes in the data. Failing to see these changes proactively can cause the model to fail silently. Such failures will cause a significant negative impact on business performance and credibility to the end-user. Find out the 5 most common reasons your ML model may be underperforming in production. 

Model monitoring can help maintain and improve the performance of an ML model in production, ensuring that the model performs as intended. A deployed ML model interacts with the real world. Therefore, the data the model sees in production is constantly changing. A model’s performance will often begin to degrade once it’s deployed to production. 

Monitoring performance degradation will help you quickly detect when a model is underperforming. The performance metrics are specific to the model and learning task. For example, accuracy, precision, and F1-score would be used for classification tasks, and root-mean-squared error would be used for prediction tasks. Beyond observing performance metrics on real-world data, a data science team can perform checks on the input data to gain further insights on performance degradation.

Furthermore, measuring model drift is an essential component of an ML monitoring system. Model inputs, outputs, and actuals are subject to drift over time, measured as a change in distribution. Check your models for drift to determine if they are stale, if you have data quality issues, or if they include adversarial inputs. You can better understand how to resolve these issues by detecting drift with ML monitoring.

A comprehensive model monitoring solution should include:

1. Data drift detection: Keeping track of the distribution of each input feature can help reveal changes in the input data over time. You can extend this tracking to joint distributions.

2. Data integrity detection: To detect changes in the input data structure, check that feature names are the same as those in your training set. Scanning the input for missing values will reveal changes or issues in data gathering pipelines.

3. Concept drift detection: Knowing the importance of each feature in your input data relative to the output is a simple yet effective guard against concept drift. Variations in feature relevance are an indication of concept drift. We can evaluate the model on specific features or perform correlation studies over time. These techniques also help understand changes in model performance. For example, you might discover a relationship between the significance of some features and the time of year.

Concept drift Checking the input data establishes a short feedback loop to quickly detect when the production model starts underperforming. 

In addition to performance degradation, a production model may underperform due to data bias or anomalies.

1. Data bias: Training your model on biased data will carry out that bias in production. Consider training a model to classify cats’ and dogs’ images. Suppose your training set has disproportionately more cat images than dog images. In that case, your model may get a good accuracy simply by learning to classify most images as cats rather than learning an actual boundary between cats and dogs. To prevent biased model outputs in production, we can analyze the training data for imbalanced representations or skews in the target variable and input features.

2. Anomalies: Anomalies are input samples that are outliers relative to the distribution of the training samples. Running inference on an outlier nearly guarantees an inaccurate result. To prevent poor model performance due to anomalies, we can first evaluate each input sample to ensure it belongs to the distribution of our training data.

Drift Detection in ML Models

How to Detect Model Drift in ML Models

An obvious way to detect model drift is through ML monitoring techniques and solutions like Aporia, which ensure that the model performance is not degrading beyond a certain point. As data drift and concept drift are the primary sources of model drift, it is necessary to have the ability to detect data and concept drift.

How to Detect Data Drift in ML Models

Data drift occurs due to changes in your input data. Therefore, to detect data drift, you must observe your model’s input data in production and compare that to your training data. Noticing that the production input data and the training data do not have the same format or distribution, is an indication that you are experiencing data drift. For example, in the case of changes in data format, consider that you trained a model for house price prediction. In production, ensure that the input matrix has the same columns as the data you used during training. Changes in the distribution of the input data relative to the training data will require statistical techniques to detect. The following tests can be used to detect changes in the distribution of the input data

1. Kolmogorov-Smirnov (K-S) test: You can use the K-S test to compare the distribution of your training set to your inputs in production. The null hypothesis is rejected if the distributions aren’t the same, indicating data drift. Learn more about this detection method and others in our guide on Concept Drift Detection Methods.

2. Population stability Index (PSI): The PSI of a random variable is a measure of change in the variable’s distribution over time. In the example of the house price prediction system, you can measure the PSI on features of interest, such as square footage or average neighborhood income, to observe how the distributions of those features are changing over time. Large changes may indicate data drift.

3. Z-score: The z-core can compare the distribution of features between the training data and the production data. If the absolute value of the calculated z-score is high, you may be experiencing data drift.

How to Detect Concept Drift in ML Models

You can detect concept drift by detecting changes in prediction probabilities given the input. Detecting changes in your model’s output given production inputs could indicate changes at a level of analysis where you are not operating. For example, if your house price classification model is not accounting for inflation, your model will start underestimating house prices. You can also detect concept drift through ML monitoring techniques, such as performance monitoring. Observing a change in the accuracy of your model or the classification confidence could indicate concept drift. 

Concept drift detection method

How to Prevent Concept Drift in ML Models

You can prevent concept drift through ML model monitoring. ML monitoring will reveal degradation in model performance that could indicate concept drift, thus prompting ML developers to update the model. 

In addition to this observation-based prevention method, you can leverage a time-based approach, where the ML model is periodically retrained given a degradation timeframe. For example, if the model’s performance becomes unacceptable every four months, retrain every three months. 

Finally, you can prevent concept drift with online learning. In online learning, your model will train every time new data is available, instead of waiting to accumulate a large dataset and then retraining the model.

ML Performance Monitoring

How to Monitor ML Performance

Performance monitoring helps us detect that a production ML model is underperforming and understand why it is underperforming. Monitoring ML performance often includes monitoring model activity, metric change, model staleness (or freshness), and performance degradation. The insights gained through ML performance monitoring will advise changes to make to improve performance, such as hyperparameter tuning, transfer learning, model retraining, developing a new model, and more. 

Monitoring performance depends on the model‘s task. An image classification model would use accuracy as the performance metric, but mean squared error (MSE) is better for a regression model. It is important to understand that a bad performance does not mean that model performance is degrading. For example, when using MSE, we can expect that sensitivity to outliers will decrease the model’s performance over a given batch. However, observing this degradation does not indicate that the model’s performance is getting worse. It is simply an artifact of having an outlier in the input data while using MSE as your metric. Evaluating the input data is a good ML performance monitoring practice and will shed light on such instances of performance degradation. 

In monitoring the performance of an ML model, we need to define what bad performance is clearly. This typically means specifying an accuracy score or error as the expected value and observing any deviation from the expected performance over time. In practice, data scientists understand that a model will not likely perform as well on real-world data as the test data used during development. Additionally, real-world data is very likely to change over time. For these reasons, we can expect and tolerate some level of performance decay once the model is deployed. To this end, we use an upper and lower bound for the expected performance of the model. The data science team should carefully choose the parameters that define expected performance in collaboration with subject matter experts.

Performance decay has very different consequences depending on the use case. The level of performance decay acceptable thus depends on the application of the model. For example, we may tolerate a 3% accuracy decrease on an animal sound classification app, but a 3% accuracy decrease would be unacceptable for a brain tumor detection system. 

ML performance monitoring is a valuable tool to detect when a production model is underperforming and what we can do to improve. To remediate issues in an underperforming model, it is helpful to:

1. Keep data preprocessing and the ML model in separate modules. Keeping data preprocessing and the ML model as separate modules helps you fix a degrading model more efficiently when changes to the preprocessing pipeline are sufficient. Consider that you built a model that performs handwriting classification on mail in a US post office. In production, the post office decides to get lower intensity light bulbs for energy savings. Your model is now performing on much darker images. In this case, changing the data preprocessing module to increase pixel intensities and enhance borders is enough to improve model performance. It is also significantly less expensive and time-consuming than retraining the model.

2. Use a baseline. A baseline model is a simpler and more interpretable model that gets good results. You use a baseline model as a sanity check for your big fancy production model. For example, the baseline for an LSTM for time-series data could be a logistic regression model. Observing a decrease in performance in your production model while the baseline model has good performance could indicate that your production model overfits on the training data. In this case, tweaks to the regularization hyperparameter will improve model performance. Without the baseline model, you might conclude that the model is underperforming due to data or concept drift and retrain or build a new model.

3. Choose a model architecture that is easily retrainable. Neural networks are powerful ML algorithms because of their ability to approximate any complex function. Furthermore, they are particularly well suited for production because it is possible only to train parts of a neural network. For example, an image classification model that encounters images from new classes does not require complete end-to-end retraining. Instead, we can transfer learn–Only retrain the classification part of the network– with the additional classes and redeploy.

To gain further insight from monitoring model performance, it is useful to visualize the production input data relative to the training data and detect anomalies as described in the section: “How do you monitor machine learning?” above.

How to Improve Model Performance

Even if concept drift and data drift are brought under control, ML model performance declines may still occur over time. Data scientists need to constantly train ML models on new and updated data to combat this phenomenon, leading ML model performance to decline again unless model performance is improved regularly.

The following are some techniques you can use to improve model performance:

1. Use a more sophisticated tool: Better tools may offer more features to improve ML model performance, but the time needed to implement these new ML model tools into already-existing ML systems must be considered.

2. Use more data: increasing the amount of data used to train a model will help the model generalize better, thus remaining relevant for longer. This solution can become impractical if the ML system requires large amounts of data to train. 

3. Use ML model ensemble methods: ML model ensembles are known to improve ML models’ performance, as the ensemble predicts the most likely label based on the predictions of several different models. Ensembles can help ML systems avoid concept drift because if one model in the ensemble is experiencing a drift, its contribution to the ensemble’s prediction is overshadowed by the other models in the ensemble. This method comes at the cost of maintaining the ensemble itself. Ensembles need to be monitored carefully not to cause more harm than performance improvement.

4. Use ML models with higher predictive power: Those who want to build ML systems that take advantage of concept drift and data drift may consider using ML models that are more powerful in general, such as random forests or generalized linear models (glm). ML model ensembles can also be created using high-performance ML models. Model feature selection can be considered as a way to improve model performance though concept drift can cause this method to fail, leading ML developers to use more sophisticated ML model algorithms.

ML Model Management

What is ML Model Management?

Model management is a subset of MLOps that focuses on experiment tracking, model versioning, deployment, and monitoring. When developing an ML model, data scientists often perform several experiments to find the champion model. These experiments include changes in data preprocessing, hyperparameter tuning, and the model architecture itself. The purpose of the tests is to find the best model for a particular use case. Data scientists often don’t know if a current configuration is optimal until future experimentation with sub-optimal configurations. Thus, keeping track of experiments is crucial to developing ML models.

In a typical scenario, model development is a collaborative effort. Data scientists often use existing notebooks from peers as starting points or run their experiments. This collaboration increases the difficulty of reproducing desired results.

Model management addresses these challenges in the following ways:

  • Tracking metrics, losses, code, and data versions to facilitate experiment replicability


  • Enabling reusability through the delivery of models in repeatable configurations


  • Ensuring compliance with changes in business and regulatory requirements


Version control systems are used within ML model management but only provide part of the necessary capabilities. A version control system only keeps track of changes in the system’s source code over time. A practical ML model management framework must also leverage the following:

  • ML Model Monitoring: A system that gives visibility to ML models in production, and enables detection for issues like data drift, unexpected bias, data integrity issues, and more that impact a model’s predictions and performance.


  • Explainability: The ability to understand the relationship between features in the input data and the model’s predictions.


  • Data Versioning System: Data Versioning keeps track of changes made to the dataset for experimenting, training, and deploying. Causes for different data versions include changes in data preprocessing and changes in the data source. For more information about data versioning, read our Best Data Versioning Tools for MLOps post.


  • Experiment Tracking: The experiment tracker records the results of each training or validation experiment as well as the configuration that produced these results. The recorded configurations include hyperparameters such as learning rate, batch size, or regularization terms, to name a few.


  • Model Registry: A registry of all models in deployment.


Building machine learning systems for production is a craft apart from developing ML models in research. As such, production ML requires its own set of tools and practices for successfully delivering solutions at scale. Integrating ML management from the start ensures that you are using the right tools for the job. 

See a hands-on tutorial on Building an ML Platform from Scratch.

Why Manage & Monitor Your ML Models After Deployment

Deployed models are exposed to real-world data that is constantly changing. ML management after deploying a model is therefore critical to ensure that the model continues to perform as desired. One of the subsets of ML management is ML monitoring, a set of tools to observe the quality and performance of an ML model in production. Having an ML management framework for deployed models helps teams track performance metrics, monitor changes in the data, and gain valuable insights on why a model is underperforming, which will inform changes to improve performance. For example, visualizing input data in production relative to the data the model was trained on can reveal data drift, prompting your team to retrain the deployed model on more up-to-date data.

ML management can also help you keep track of all your models in deployment. ML management includes keeping a model registry of all deployed models and using a model versioning system. The model registry and versioning coupled with performance monitoring provide a convenient global health dashboard of your ML models in production. With a model registry and versioning system, teams can better pinpoint which peculiarities cause a given model version to underperform in some settings. This makes improving the deployed model more efficient.

Finally, managing ML models after deployment will help keep track of degrading models in production and better schedule diagnostic tests to gain further insights into bad performance.


Explainability (XAI)

What is Explainability for Machine Learning?

Making an ML model explainable is about developing the ability to understand the relationship between features in the input data and the model’s predictions. Machine Learning often employs architectures with several thousand learnable parameters, working on estimating a complex function. This makes it difficult to describe what goes on inside the model to produce its outputs easily. This issue has earned ML models the title of “black box”. 

Ensuring that an ML model is explainable is complicated in the real world because:

  • The interpretation of an algorithm’s result depends on how much data you have available
  • There are many ways that a machine learning algorithm can be wrong.
We measure a model’s explainability by looking at a few different aspects:

1. Whether the decision-making process is explainable

2. How accurate a model is in predicting an outcome (i.e. its accuracy)

3. How reliable are a classifier’s decisions

Trying to understand what’s gone wrong with a machine learning algorithm requires much investigation, which can be challenging. In particular, if there are biases present in the data used to train a model, we can’t tell whether the biases were a result of a mistake in training or if they are simply due to a flaw inherent in our data.

Making ML models explainable is critical in preventing model drift in production as it removes a lot of the guesswork involved in troubleshooting an underperforming model.

For a hands-on guide to achieving Explainable AI, see Aporia’s documentation on Explainability.

ML Experiment Tracking

What is ML Experiment Tracking?

ML experiment tracking is the process of saving all experiment results and configurations to enable the repeatability of those experiments. 

ML researchers run several experiments to find the best model and it is hard to keep track of all the experiments and their associated results. To find a champion model, ML researchers run several experiments with various datasets, hyperparameters, model architectures, package versions, etc. 

Experiment tracking is important because it will help you and your team:

1. Organize all ML experiments in a single place. You may run experiments on your local machine while a teammate runs their experiments in the cloud or using google colab.  An ML experiment tracking system will log experiment metadata and results from any system or machine.

2. Compare and analyze experiment results. An experiment tracking system ensures that all experiments are logged using the same format, making it possible to compare different experiment configurations and results at no extra cost.

3. Enhance collaboration with your team. The experiment tracking system keeps a log of who ran each experiment. All team members can see what was already tried by other members. They can also pull an experiment run by someone else, reproduce it, and continue building from there.

4. Watch your experiment run in real-time. The experiment tracking system makes it simple to start an experiment and watch it run remotely from a dashboard. You will be able to see metrics such as loss, epoch time, and CPU/GPU usage while the experiment is running. This is particularly useful when running your experiment in an environment that makes visualization hard such as in the cloud on a remote machine.

To effectively track ML experiments, you need to keep track of:
  • Code: This includes scripts and/or notebooks used to run the experiment
  • Environment: Environment configuration files.
  • Data: Use data versioning to keep track of data versions used in your experiments
  • Parameters: Parameter configurations include hyperparameters for the model itself, such as learning rate, but also any editable options for your experiments; for example, the number of threads used by your data loader
  • Metrics: Train, validation, and test losses are examples of general metrics to track. You can track metrics specific to the model you are training. For example, you might also want to track gradient norms when working with a deep neural network.

ML Model Monitoring vs. Tracking

The fundamental difference between ML model monitoring and ML experiment tracking is that model monitoring is mostly relevant after a model is deployed in production. In contrast, tracking is most relevant before deployment. 

We implement ML monitoring to maintain and improve model performance once the model is in production. Once a model is in production, we can monitor it to observe performance metrics on real-world data and notice performance degradation when it happens.

On the other hand, ML experiment tracking deals with operationalizing the research and development of ML systems before they go into production. ML tracking helps ML researchers track codes, environment configurations, data versions, parameters, and metrics for all experiments run during the development cycle of an ML model to find the optimal configuration. ML tracking is relevant in settings where only ML research is conducted without deployment such as for the purpose of a research paper.

ML Model Registry

What is a Model Registry?

A model registry is a repository of all models in production. It provides a central point of access to all trained and available ML models. The purpose of this approach is to improve model reusability by providing a uniform way of accessing, searching, and managing every deployed model. Ecosystems related to ML (e.g., OpenML, ModelZoo, and the MODL-Wiki) are examples of community efforts for developing such a model registry.

An important aspect of a model registry is that all models are stored in one central place, which means that everyone views the same models. People collaborating on a project have a single reference to each model. The model registry bypasses issues with slightly different versions on local machines.

The model registry makes it easier to collaborate on ML projects by:

1. Connecting experiment and production lifecycles: The model registry provides a standardized way of taking a model from the development lifecycles and staging it for deployment in production. The model registry facilitates interactions between researchers and MLOps engineers through continuous integration, delivery, and training (CI/CD/CT) of ML models.

2. Presenting a central dashboard for teams to work on models. A centralized place to access models makes it easy for teams to search for models and check models’ status, such as staged, deployed, or retired. From the central dashboard, teams can also reference training and experiment results through the experiment tracker and view the model’s live performance in production through ML monitoring.

3. Surfacing an interface for other systems to consume models. A model registry can provide an API for integrating with other applications or systems, making it possible to serve an ML model to third-party client applications. The client application can pull the latest version of a model and automatically stay up to date on changes made to the model due to model degradation.

Start Monitoring Your Models in Minutes