F1 Score: A Key Metric for Evaluating Model Performance

Or Jacobi
Or Jacobi

Or is a software engineer at Aporia and an avid gaming enthusiast "All I need is a cold brew and the controller in my hand, and I'm good to go."

5 min read Aug 18, 2023

Evaluating the performance of our ML models is an integral part of our work. A model’s performance can greatly influence its utility in real-world applications. Today, in our Production ML Academy, we’re going to talk about F1 Score, a metric that combines Precision and Recall, making it particularly useful when we have imbalanced datasets or when both false positives and false negatives are costly.

What is the F1 Score?

It’s a measure of a model’s accuracy on a dataset. It is used to evaluate binary classification systems, which classify examples into ‘positive’ or ‘negative’. F1 Score is the combined measure of Precision and Recall, providing a comprehensive view of these two metrics.

F1 Score Formula

The formula for F1 Score is:

Example with a Simple Confusion Matrix

Let’s take an example. Consider a scenario where we’re building a machine learning model to detect spam emails. Here, a ‘positive’ example is a spam email, and a ‘negative’ example is a non-spam email. In this scenario:

  • True Positives (TP): Spam emails correctly identified as spam.
  • False Positives (FP): Non-spam emails incorrectly identified as spam.
  • True Negatives (TN): Non-spam emails correctly identified as non-spam.
  • False Negatives (FN): Spam emails incorrectly identified as non-spam.

Let’s say out of 100 emails, 30 are spam. The model correctly identified 25 spam emails, but classified 5 spam emails as non-spam. Additionally, the model incorrectly classified 3 non-spam emails as spam. Here, TP = 25, FP = 3, and FN = 5.

First, we calculate Precision and Recall:

Precision = TP / (TP + FP) = 25 / (25 + 3) = 0.8929 (approx)

Recall = TP / (TP + FN) = 25 / (25 + 5) = 0.8333 (approx)

Now, using the F1 Score formula:

F1 Score = 2 * (Precision * Recall) / (Precision + Recall) = 2 * (0.8929 * 0.8333) / (0.8929 + 0.8333) = 0.8621 (approx)

This tells us that the F1 Score of the model on this dataset is 0.8621, or about 86.21%.

Differentiating F1 Score from Precision and Recall

While Precision is the ratio of correctly predicted positive observations to the total predicted positives, and Recall is the ratio of correctly predicted positive observations to the all observations in actual class, F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. 

MetricDescriptionImportance and Use Cases
PrecisionRatio of correctly predicted positive observations to the total predicted positivesHigh cost of False Positive, e.g., Email spam detection
RecallRatio of correctly predicted positive observations to the all observations in actual classMissing a positive is unacceptable, e.g., Fraud detection
F1 ScoreWeighted average of Precision and Recall, takes both false positives and false negatives into accountBalance between Precision and Recall, uneven class distribution scenarios

It is worth noting that while these metrics provide critical insights into the performance of a classification model, the choice of metric depends largely on the specific application and the business requirements at hand. Some tasks might require high precision, while others necessitate high recall or a balance between both, which is encapsulated by the F1 Score.

F1 Score in the context of classification models

Binary classification

In binary classification, F1 Score becomes critical when the data is imbalanced. It gives us a single metric that encapsulates both Precision and Recall, giving us a more comprehensive view of the model’s performance.

Multi-class classification

In multi-class problems, F1 Score can be calculated for each class separately by considering one class as positive and the rest as negative. We can then calculate a weighted average of these F1 Scores.

Use cases for F1 Score

When F1 Score is Critical

  • Fraud Detection: In fraud detection, both false negatives (frauds that go undetected) and false positives (innocent transactions flagged as fraudulent) can be costly. Therefore, an F1 Score that encapsulates both Precision and Recall can be useful.
  • Imbalanced Datasets: In datasets where the class distribution is significantly skewed, the F1 Score can provide a better measure of performance than accuracy.

When F1 Score may not be the Only Priority

  • In applications where the costs of false positives and false negatives are very different, you might need to focus on either Precision or Recall rather than their harmonic mean.

Practical tips for improving F1 Score

  • Class Weighting and Balancing: If your data is imbalanced, giving higher weights to the minority class can help improve your F1 Score.
  • Threshold Moving: By adjusting the threshold for classification, you can trade off Precision and Recall to maximize the F1 Score.
  • Ensemble Methods: Use ensemble methods such as bagging and boosting to improve your model’s F1 Score.

The role of F1 Score in model evaluation and monitoring

The F1 Score is not just useful during model development, but also when the model is in production. Continuous monitoring of the F1 Score can help you ensure that your model continues to perform well and catch potential issues before they become problematic.

Limitations and Cautions of F1 Score

  • Risks of over-optimizing F1 Score: Optimizing your model to maximize the F1 Score might result in a model that performs poorly on other important business metrics. It’s important to consider all relevant metrics when tuning your model.
  • Not always the best metric: While the F1 Score can be a useful metric, it’s not always the best metric for every problem. For example, in some problems, you might care more about Precision or Recall rather than their harmonic mean.

Wrapping up, the F1 Score provides a balanced measure of Precision and Recall, and is a crucial tool in every ML engineer’s toolkit. However, as with all tools, it’s important to understand when and how to use it. By considering the unique requirements of your problem and using the F1 Score judiciously, you can develop ML models that perform well in practice.

Green Background

Control All your GenAI Apps in minutes