Precision-Recall vs. ROC AUC Curve: Choosing the Right Metric for Your Machine Learning Model

In the previous blog on Confusion Matrix 101: Understanding Precision and Recall for Machine Learning Beginners, we understood the meaning of precision and recall which are important metrics while building a classifier model. Precision and recall share an inversed relationship, that is, if a model is optimized to have higher precision, the recall of it will decrease, and vice versa. Based on the type of problem statement, it is necessary to find a compromise between these values.

In this blog, we will understand the trade-off between precision and recall. Moreover, we will learn about a popular metric called the ROC AUC curve. The popular machine learning library sklearn provides some interesting functions that help us to make the appropriate decision and find that sweet spot between precision and recall.

Tuning the Decision Threshold

The hypothesis of a classifier model that a sample is true or false depends on the defined decision threshold. For instance, let us say that for some problems, the decision threshold lies at the score of 0.5. If the model score on some sample is greater than the threshold, it will classify it as true, otherwise false.

The threshold value significantly influences the precision and recall value of the model. There are two scenarios to be considered here:

Lowering the Threshold

If we lower the threshold, say from 0.5 to 0.3, the classifier model essentially gets more “lenient”. Even if an instance has a lower probability of being positive, it will still be considered positive. This can lead to:

Higher Recall: You’re casting a wider net, so you’ll catch more true positives. Hooray for more spam caught!
Lower Precision: However, this leniency means you’re also likely to catch more false positives. Not so great if non-spam emails start getting flagged.

Increasing the Threshold

Conversely, if we raise the threshold to, say, 0.7, the machine learning model gets stricter. Only instances with a high probability of being positive will be classified as such. This results in:

Higher Precision: Our stricter criteria mean that most of the positive predictions will be correct. Non-spam emails are safe!
Lower Recall: But, some true positives are likely to be missed because the window is now tighter. Some spam might slip through.

So what is the solution? Randomly try out different threshold values and see at what value the model performs the best, right? Yeah, but this doesn’t sound practical and seems like a lot of boring work. Fortunately, scikit-learn will help us with this problem.

Precision-Recall-Threshold Curve

Recall from the previous post where we built a binary image classifier that classifies between digits that are “1” and not “1”. To get an idea of the values, first, we check the evaluation score of a few samples.

for i in range(5):
    score = classifier_model.decision_function([strat_train_set.iloc[i]])
    pred = classifier_model.predict([strat_train_set.iloc[i]])
    print(score, pred)

'''
[Output]:
[-46809.95805696] [False]
[6553.70585533] [ True]
[-13187.68186442] [False]
[4910.45286989] [ True]
[-42549.63111972] [False]
'''

As it is seen, the values range from about -47000 to 6600. These scores are only from the 5 samples. There is a chance that the scores may go beyond this range. Plotting precision and recall against different values of thresholds helps in selecting the appropriate threshold value. To do this, we first get the scores over the samples using cross_val_predict() function with decision_function as the preferred method to calculate the score.

from sklearn.metrics import precision_recall_curve

scores = cross_val_predict(classifier_model, strat_train_set, strat_train_labels, cv=3, method="decision_function")

precisions, recall, thresholds = precision_recall_curve(strat_train_labels, scores)

plt.plot(thresholds, precisions[:-1], "r-", label="Precision")
plt.plot(thresholds, recall[:-1], "b-", label="Recall")
plt.grid()
plt.legend(loc ="center left")
plt.xlabel(["Threshold"], fontweight='bold')
plt.xlim([-60000, 20000])
plt.show()

Let us first understand the function precision_recall_curve() of sklearn. The precision_recall_curve() function is used to compute precision-recall pairs for different probability thresholds in binary classification tasks. Let’s break down the components and their functionality:

Inputs:
- strat_train_labels: These are the true binary labels for the training dataset. It is an array where each element is either 0 (negative class) or 1 (positive class).
- scores: These are the predicted scores for each instance in the training dataset. These scores are usually probabilities or confidence scores output by a classifier.
Outputs:
- precisions: An array of precision values calculated at different thresholds. Precision is the ratio of true positive predictions to the total number of positive predictions (true positives + false positives).
- recall: An array of recall values calculated at different thresholds. Recall (or sensitivity) is the ratio of true positive predictions to the total number of actual positives (true positives + false negatives).
- thresholds: An array of thresholds used to compute the corresponding precision and recall values. These thresholds are the values from the scores array at which the precision and recall are evaluated.

Using different values of thresholds provided, the function calculates different values of precisions and recalls, that we ultimately plot using matplotlib. From the plot, we conclude that between values from 0 to 10000, the recall dips sharply from about 0.95 to almost 0, while the precision rests at a perfect 1. If the priority of the machine learning problem is to get better recall, the optimum threshold would be at about -5000. The recall is more than 95% at that value while the precision also lies at more than 90%. Let us test this value in our machine learning example.

preferred_threshold = 0
for i in range(len(thresholds)):
    if recall[i] >= .95 and precisions[i] >= 0.9:
        preferred_threshold = thresholds[i]
        break

print(preferred_threshold)
# -1790.67257359577

We have the threshold value of -5323.207. Let us test this and monitor the model performance.

y_scores = cross_val_predict(classifier_model, strat_train_set, strat_train_labels, cv=3,
method="decision_function")

new_scores = (y_scores >= preferred_threshold)
y_precision = precision_score(strat_train_labels, new_scores)
y_recall = recall_score(strat_train_labels, new_scores)

# Precision = 0.8318273769386378
# Recall = 0.9787369089178038)

Almost 98% recall! The classifier can predict almost 98% of true values. However what about false positives? Looking at the precision, when the model classifies an image as true, it is correct 83% of the time. Though the performance might seem acceptable, there is still room for improvement.

Remember what precision and recall mean: precision tells us among all the samples that the model classified as positive, how many of them are actually true. While recall tells us of all the actual true values that exist, how many of them were classified so by the model. However, neither provides any information on the false positive rate, that is, the ratio of false positives to actual negatives. This information is useful when the number of samples belonging to the negative class is less compared to those in positive class, unlike this example where the number of samples in the negative class was relatively higher than that of the positive class. This is where we have a look at the ROC curve.

The ROC Curve

Similar to precision-recall, the ROC (Receiver Operating Characteristic) curve is a tool that is widely used in machine learning problems with binary classification. While the former provides information on precision and recall values across different threshold values, the ROC curve provides us information on the change of true positive rate (another term for recall) with respect to false positive rate (ratio of false positives to actual negatives). An ideal classifier would have a 100% recall and 0% false positive rate. Let us check the ROC curve of our binary classifier:

from sklearn.metrics import roc_curve

false_pos_rate, true_pos_rate, thresholds = roc_curve(strat_train_labels, scores)

plt.plot(false_pos_rate, true_pos_rate, linewidth=2)
plt.plot([0,1], [0,1], 'k--')
plt.xlabel("False Positive Rate", fontweight="bold")
plt.ylabel("True Positive Rate", fontweight="bold")
plt.show()

The plot tells us that the model performs well as it keeps the false positive rate lower as the true positive rate increases but only up to a certain limit. At some point, the false positive rate increases drastically with a slight increase in the true positive rate. Yet again, we need to set a point of compromise.

One more thing to note here is the dotted diagonal that runs across the plot. It demonstrates the performance of an average, random classifier. It is preferred that the performance of our model stay as far as possible from that line.

When we work with more than one machine learning model, it is a good approach to compare their performances using ROC curve. In such a situation, the model whose ROC curve covers more area under it is said to be the better model.

The ROC AUC Curve

AUC here stands for Area Under Curve. To calculate this, we use the roc_auc_score() function of sklearn:

from sklearn.metrics import roc_auc_score
roc_auc_score(strat_train_labels, scores)

# 0.9957041674206655

Now we know how our model performs when it comes to wrongly classifying samples as positive. This value can be compared to that of some other classifiers such as RandomForestClassifier.

Summary

In this blog on ROC, we understood how to analyze the performance of a classifier model in machine learning using the scikit-learn library in Python. The performance metrics in this case, namely precision and recall, can be tuned by selecting the corresponding threshold values. When working with data where the number of negatively labeled samples is relatively less than the positive ones, we come across an additional metric, that is, the ROC. Moreover, the area under ROC can be used to compare different machine learning models and choose the one with the best performance.

This leads to the question: What if our data is updated with new samples that are negatively labeled? Do we have to repeat this process of splitting the data, plotting the curve, and determining the ROC-AUC value or is there any simplified way of building a pipeline for once and running the data through it everything it is updated…

If you enjoyed this blog, which you did are you are still here reading this, join me on social media and follow me as a sign of support. It will cost you just a couple of seconds but will provide me the motivation to continue providing interesting content on machine learning and programming: