Evaluating recommendation systems (ROC, AUC, and Precision-Recall)

You probably have heard of terms like ROC, AUC, and Precision-recall, they show up in data science articles on Medium, machine learning tutorials, and academic papers are full of them. But why are they so important and what do they actually mean? Today we will dive into the specifics of those essential metrics, and explain how they work and their importance in the world of machine learning and recommendation systems.

Evaluating recommendation systems (ROC, AUC, and Precision-Recall)
March 28, 2023
 • 
10
 min read
by 
Heorhii Skovorodnikov

Bubble tea🧋 and RecSys metrics

The first thing you need to know is that all these metrics are used to provide a quantifiable measure of performance. Let’s imagine that you are a machine learning developer, building an app for the delivery of bubble tea, these days there are many different vendors offering a wide variety of options. But you want to make sure your users will buy the tea from your app, to make it happen you want to suggest the best bubble tea options based on their preferred flavors, tea type, sugar level, and time of delivery (gotta keep it fresh). All of this can be framed as a typical recommendation problem.

And for simplicity let’s assume that you are just starting out so you have only two types of bubble tea, one that has tapioca balls and another coconut jelly. After tracking and recording user orders you build a dataset of customers who prefer these two types of tea. Next, you want to classify them so that your general recommendation model can infer from user data that it would prefer bubble tea chains and products that contain tapioca. This type of problem is known as binary classification, in simple RecSys terms the two things you are trying to predict are users’ likes and dislikes.

In the real world, this task gets more complex but these evaluation metrics actually work in the same way for a multitude of options and scenarios. So let’s take our example and explain all the metrics using it.

ROC - Receiver Operating Characteristic

We have trained our model and now we need to test it out, so where do we start? We begin at ROC. ROC stands for Receiver Operating Characteristic. It's a graphical representation of the performance of a binary classification model. It plots the true positive rate (TPR) against the false positive rate (FPR) at different threshold settings.

To break it down let’s explain those terms:

TPR is the percentage of correctly predicted positive examples out of all the actual positive examples. Here our model will be correctly predicting that our user likes the type of bubble tea we selected.
From the equation above: TPR = TP(True Positives) / P(All Positives)

Now what about FPR?

FPR is the percentage of incorrectly predicted positive examples out of all the actual negative examples. So in our case, the model will predict the wrong type of bubble tea that the customer likes.
From the equation above: FPR = FP(False Positives) / N(All Negatives)

As we can see both FPR and TPR relate to each other and can be calculated in different ways.

A perfect model would have a ROC curve that hugs the top-left corner of the plot, meaning that it would have a high TPR and a low FPR at all threshold settings. A model that makes random predictions would have a ROC curve that is a diagonal line from the bottom-left to the top-right corner of the plot, meaning that its TPR and FPR would be equal at all threshold settings. A threshold in this case is the value above which the prediction becomes positive or negative depending on your setup. For example threshold >= 0.5 means that every sample that gets above 50% for the target class becomes this class.

Threshold settings are used to adjust the balance between true positives and false positives by changing the criteria that determine when an example is classified as positive. These adjustments are reflected in the ROC curve which is a graphical representation of the performance of a classification model.

Here we can see the ROC curve and a dashed line labeled random. This represents ROC curve of a classifier that makes random predictions.

AUC - Area Under the Curve

ROC seems sufficient so why use AUC? AUC or Area Under the Curve summarizes ROC across all thresholds and is therefore the area under the ROC curve.

AUC is calculated by finding the area under the curve of the Receiver Operating Characteristic (ROC) curve, which plots the true positive rate (TPR) against the false positive rate (FPR) at different threshold settings and combining them into a single score.

A perfect model would have an AUC of 1, meaning that all the positive examples would be ranked higher than all the negative examples, while a model that makes random predictions would have an AUC of 0.5. This detail is important as this makes AUC a better metric than a plain accuracy score which is just a percentage of correct predictions against all the possible inputs in the dataset.

AUC can be interpreted as the probability that a randomly chosen positive example will be ranked higher than a randomly chosen negative example. Below we can see how AUC looks like when plotted, it is the shaded blue area under the ROC curve. If we were to shade everything below random line we would get AUC of 0.5 → a result signifying random predictions coming from our classifier model.

Recalling it all with Precision-Recall

We can clearly see now that all of these metrics are related. It is directly because they all rely on the same common variables like TP, TN, FP, and FN. So what’s so different about precision-recall?

In essence, the task PR fulfills is very similar but with a slight difference. While ROC and AUC look to measure the ability of the classifier to separate positive and negative examples and are a good choice when the dataset is balanced (equal number of both classes), PR is looking to measure a model's ability to identify positive samples while minimizing false positives at the same time. It is considered a better choice for imbalanced datasets and a good option when you are more interested in positive examples.

                  Precison-Recall in the data

To better understand this concept let’s recall the equations for TPR and FPR above and see the difference between them and equations for precision and recall:

And for recall:

Notice anything common? That’s right! TPR and Recall are the same. There is one more important detail, notice how true negatives are missing from the equations above? As mentioned PR focuses on the positive examples as under this metric they are of the bigger interest to us.

A perfect model would have a precision of 1 and a recall of 1, meaning that all of the positive examples are correctly predicted and all of the actual positive examples are predicted as positive. In practice, there is a trade-off between precision and recall, and a model with high precision might have a low recall and vice versa. PR helps to understand how well the model is able to identify relevant examples while minimizing the number of irrelevant examples.

This sums up our journey into popular ML metrics!

Get up and running in just 1 sprint 🏃

For developers

Waitlist for public API keys
You're on the waitlist! We'll be in touch 🙌
Oops! Something went wrong while submitting the form.

For companies 🏢

Schedule a demo with your data️
Thanks for signing up! We'll be in touch 🙌
Oops! Something went wrong while submitting the form.