Tom Alon

## THE APORIA ACADEMY

### ML Observability Expert

#### 1. Intro

- ML Observability: Evaluate machine learning model performance in production
- MLOps is Not Always DevOps

#### 2. ML evaluation metrics

#### 3. Drift Metrics

#### 4. Fairness metrics

#### 5. Explainability

#### 6. Production ML

- Outliers in A/B Testing
- Data sampling
- Optimizing ad placement in search & recommendation systems
- Entity-level monitoring

#### 7. LLM Observability

# Ultimate Guide to PR-AUC: Calculations, uses, and limitations

### Table of Contents

Understanding evaluation metrics is a crucial aspect of creating effective machine learning models. One such metric is the Precision-Recall AUC (Area Under the Curve). This guide will dive into what this metric is, why we use it, how to calculate it, when to use it, and its challenges. Let’s get started.

If you’re in need of a quick reminder, we’ve previously covered Precision and Recall on their own.

## What is Precision-Recall AUC?

The Precision-Recall AUC (PR-AUC) is an evaluation metric used particularly for binary and multilabel classification problems. It represents the area under the Precision-Recall curve, which plots Precision (the proportion of true positives among all positive predictions) against Recall (the proportion of true positives identified correctly) at various threshold settings.

In simpler terms, the PR AUC quantifies how well a model can distinguish between classes, considering both its ability to not mark a negative sample as positive (Precision) and its ability to find all the positive samples (Recall). A higher PR AUC value signifies a better-performing model.

## Why do we need PR AUC?

PR AUC is often used when dealing with imbalanced datasets – a situation where the number of observations is not evenly distributed between the target classes. In these cases, metrics like accuracy can be misleading.

For instance, in a dataset with 95% negatives and 5% positives, a model that always predicts negative will have an accuracy of 95%. However, this model is useless for predicting positive instances. PR AUC, on the other hand, does a great job of addressing this imbalance by focusing on the rare class (positive class in this case).

Standard accuracy does not account for the imbalanced class distribution and will provide an overly optimistic picture of the model’s performance. In such situations, Precision-Recall AUC (PR AUC) is a more useful measure because it specifically focuses on the performance of the model on the positive class, which is the minority in this case. PR AUC takes into account both the ability of the model to correctly predict positive instances (Recall) and to not label a negative sample as positive (Precision).

So, in scenarios with imbalanced classes where the positive class is of interest, PR AUC can provide a more informative and nuanced evaluation of model performance compared to standard accuracy.

## How to calculate PR AUC?

Calculating the PR AUC involves multiple steps:

**Predict Probabilities**: First, your model needs to predict probabilities for the positive class.

**Compute Precision and Recall**: For each unique probability, calculate Precision and Recall. You can adjust the decision threshold to categorize a predicted probability as positive or negative, which will give different values of Precision and Recall.

**Plot Precision-Recall Curve**: Plot Precision (y-axis) against Recall (x-axis).

**Calculate Area**: Finally, compute the area under this curve, which is the PR AUC.

Most modern machine learning libraries, such as Scikit-Learn in Python, provide functions to calculate PR AUC directly.

**Example**

This example assumes you have a binary classification model and a dataset to test it on.

```
from sklearn.metrics import precision_recall_curve, auc
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
# Step 1: Generate a random binary classification dataset.
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2, n_redundant=10, random_state=42)
# Step 2: Split the dataset into training and test datasets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 3: Fit a model to the data.
model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)
# Step 4: Predict probabilities for the test dataset.
y_scores = model.predict_proba(X_test)[:, 1]
# Step 5: Compute Precision and Recall for different thresholds.
precision, recall, thresholds = precision_recall_curve(y_test, y_scores)
# Step 6: Calculate Area Under the PR curve.
pr_auc = auc(recall, precision)
# Print the PR AUC
print(f'PR AUC: {pr_auc}')
# Step 7: Plot the Precision-Recall curve.
plt.plot(recall, precision, marker='.', label='Logistic')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall curve')
plt.legend()
plt.show()
```

This script first generates a random binary classification problem, splits it into training and test sets, and fits a logistic regression model to the data. It then predicts class probabilities for the test set and computes precision and recall for various decision thresholds. Finally, it computes and prints the area under the precision-recall curve (PR AUC), and plots the precision-recall curve.

## When to use PR AUC?

As mentioned before, PR AUC is especially useful when dealing with imbalanced datasets. It’s ideal for situations where the positive class is of more interest, and false positives and false negatives have a high cost. Examples include fraud detection, disease diagnosis, and churn prediction, where the positive class (fraud, disease, churn) is often the minority class.

Let’s consider a use case where we want to predict credit card fraud, which is indeed a highly imbalanced problem because the number of genuine transactions is far more than the number of fraudulent transactions.

We’ll use Python and its libraries, such as ** sklearn** for model building and evaluation,

**for data handling, and**

`pandas`

**for numerical operations.**

`numpy`

Here’s a simple example using logistic regression as a model. We’ll use the ** precision_recall_curve** function from sklearn to calculate the precision-recall curve and the

**function to calculate the area under the curve (PR AUC).**

`auc`

```
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve, auc
# Load your dataset
# For the purpose of this example, let's assume you have a dataframe `df` with features and a target column 'is_fraud' indicating whether a transaction is fraudulent (1) or not (0)
# df = pd.read_csv("your_dataset.csv")
# Split your data into features (X) and target (y)
X = df.drop('is_fraud', axis=1)
y = df['is_fraud']
# Split your data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit a logistic regression model to the training data
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
# Use the model to predict the probabilities of fraud for the test set transactions
y_score = model.predict_proba(X_test)[:, 1]
# Compute the precision-recall curve
precision, recall, _ = precision_recall_curve(y_test, y_score)
# Compute PR AUC
pr_auc = auc(recall, precision)
print('PR AUC: ', pr_auc)
```

Remember to replace ** your_dataset.csv** with the path to your actual data file. The

**parameter in**

`max_iter`

**is set to 1000 to ensure convergence for the model. Adjust it as needed based on your specific data and problem.**

`LogisticRegression`

This code provides a basic example of computing the PR AUC for a model. In a real-world scenario, you would likely perform additional steps such as data cleaning, feature selection, hyperparameter tuning, model validation, etc. Also, keep in mind that while logistic regression is a simple and fast model, it may not provide the best results for all problems or datasets. You may need to try different models (like decision trees, random forest, or neural networks) to get the best results.

## Limitations of PR AUC

Despite its strengths, PR AUC has limitations:

### Interpretation

PR AUC can be less intuitive to interpret than other metrics such as accuracy or F1-score. For instance, consider a model with an accuracy of 95%. It’s straightforward to interpret: the model correctly predicts 95 out of 100 instances on average. Now consider a PR AUC of 0.75. What this means is that the model’s performance, as assessed by the interplay between precision and recall across various thresholds, is represented as the area under the Precision-Recall curve, which equals 0.75 in this instance. This can be more challenging to grasp intuitively. Unlike specific metrics such as precision or recall at a given threshold, PR AUC offers a comprehensive view of the model’s ability to balance precision and recall over the entire range of possible thresholds.

### Sensitivity to Class Imbalance

While it is useful for imbalanced datasets, extreme class imbalance can make the PR AUC volatile and sensitive to changes in the minority class. Let’s assume a disease diagnosis model trained on a dataset with a 1% prevalence of the disease. If the model’s ability to identify these rare cases varies slightly – say, between different iterations of training – the PR AUC could fluctuate dramatically, even if overall performance remains similar. This is because the Precision-Recall curve, and consequently the PR AUC, is heavily influenced by how the model performs on this rare class.

## The Difference Between PR AUC and ROC AUC

While both PR AUC and ROC AUC (Receiver Operating Characteristic AUC) are popular metrics used in model evaluation, they serve different purposes and should be used in different contexts. ROC AUC evaluates the trade-off between true positive rate and false positive rate, and is less sensitive to class imbalance. PR AUC, on the other hand, is more appropriate when the positive class is rare or when false positives are more important than false negatives.

Metric | Purpose | Sensitivity to Class Imbalance | Best Used When |

ROC AUC | Evaluates trade-off between true positive rate and false positive rate | Less sensitive | The class distribution is balanced or the cost of false positives and negatives are roughly equal |

PR AUC | More suited to evaluating the precision-recall trade-off | More sensitive | The positive class is rare or false positives are more significant than false negatives |

## PR AUC in Model Monitoring and Evaluation

PR AUC is a valuable tool for model monitoring and evaluation. By comparing the PR AUC of different models or different versions of the same model, you can select the model that best handles the trade-off between Precision and Recall. In addition, tracking changes in the PR AUC over time can help identify model drift, when the model’s performance deteriorates due to changes in the underlying data distribution.

## Conclusion

While Precision-Recall AUC is not a silver bullet for all machine learning problems, it is an invaluable tool when dealing with imbalanced datasets and when the cost of false positives and false negatives is significant. As with any metric, it’s crucial to understand when and how to use PR AUC to extract its maximum benefits. Stay tuned for more insights into machine learning metrics, and happy modeling!