MLOps Blog

24 Evaluation Metrics for Binary Classification (And When to Use Them)

16 min
5th September, 2023

Classification metrics let you assess the performance of machine learning models but there are so many of them, each one has its own benefits and drawbacks, and selecting an evaluation metric that works for your problem can sometimes be really tricky.

In this article, you will learn about a bunch of common and lesser-known evaluation metrics and charts to understand how to choose the model performance metric for your problem.

Specifically, I will talk about:

  • What is the definition and intuition behind most major classification metrics,
  • The non-technical explanation that you can communicate to business stakeholders about metrics for binary classification,
  • How to plot performance charts and calculate common metrics for binary classification,
  • When should you use them.

With that, you will understand the trade-offs so that making metric-related decisions will be easier. 

What exactly are classification metrics?

Simply put a classification metric is a number that measures the performance that your machine learning model when it comes to assigning observations to certain classes. 

Binary classification is a particular situation where you just have to classes: positive and negative. 

Typically the performance is presented on a range from 0 to 1 (though not always) where a score of 1 is reserved for the perfect model. 

Not to bore you with dry definitions letā€™s discuss various classification metrics on an example fraud-detection problem based on a recent Kaggle competiton.

I selected 43 features and sampled 66000 observations from the original dataset adjusting the fraction of positive class to 0.09.

Then I trained a bunch of lightGBM classifiers with different hyperparameters. I only used learning_rate and n_estimators parameters because I wanted to have an intuition as to which models are ā€œtrulyā€ better. Specifically, I suspect that the model with only 10 trees is worse than a model with 100 trees. Of course, as use more trees and smaller learning rates it gets tricky but I think it is a decent proxy.

So for combinations of learning_rate and n_estimators, I did the following:

  • defined hyperparameter values:
MODEL_PARAMS = {'random_state': 1234,
                'learning_rate': 0.1,
                'n_estimators': 10}
  • trained the model:
model = lightgbm.LGBMClassifier(**MODEL_PARAMS)
model.fit(X_train, y_train)
  • predicted on test data:
y_test_pred = model.predict_proba(X_test)
  • logged scores for each run:
run["logs/score"] = score

To know more about logging scores and metrics visit Neptune docs.

  • logged matplolib figures for each run:
run["images/figure"].upload(neptune.types.File.as_image(fig))

To know more about logging matplotlib figures visit Neptune docs.

Here, you can explore experiment runs with:

  • evaluation metrics
  • performance charts
  • metric by threshold plots
compare experiments
Runs table in Neptune | See in the app

Ok, now we are ready to talk about those classification metrics!

Learn about the following evaluation metrics

I know it is a lot to go over at once. That is why you can jump to the section that is interesting to you and read just that.

1. Confusion Matrix

How to compute:

It is a common way of presenting true positive (tp), true negative (tn), false positive (fp) and false negative (fn) predictions. Those values are presented in the form of a matrix where the Y-axis shows the true classes while the X-axis shows the predicted classes.

It is calculated on class predictions, which means the outputs from your model need to be thresholded first.

from sklearn.metrics import confusion_matrix

y_pred_class = y_pred_pos > threshold
cm = confusion_matrix(y_true, y_pred_class)
tn, fp, fn, tp = cm.ravel()

How does it look:

plot confusion matrix

So in this example, we can see that:

  • 11918 predictions were true negatives,
  • 872 were true positives,
  • 82 were false positives,
  • 333 predictions were false negatives.

Also, as we already know, this is an imbalanced problem. By the way, if you want to read more about imbalanced problems I recommend taking a look at this article by Tom Fawcett.

When to use it:

  • Pretty much always. I like to see the nominal values rather than normalized to get a feeling on how the model is doing on different, often imbalanced, classes.

2. False Positive Rate | Type I error

When we predict something when it isnā€™t we are contributing to the false positive rate. You can think of it as a fraction of false alerts that will be raised based on your model predictions.

false positive rate

How to compute:

from sklearn.metrics import confusion_matrix

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
false_positive_rate = fp / (fp + tn)

# log score to neptune
run[ā€œlogs/false_positive_rateā€] = false_positive_rate

How models score in this metric (threshold=0.5):

false positive rate comparison

For all the models type-1 error alerts are pretty low but by adjusting the threshold we can get an even lower ratio. Since we have true negatives in the denominator, our error will tend to be low just because the dataset is imbalanced.

How does it depend on the threshold:

false positive rate by threshold

Obviously, if we increase the threshold only higher scored observations will be classified as positive. In our example, we can see that to reach perfect FPR of 0 we need to increase the threshold to 0.83. However, that will likely mean only very few predictions classified.

When to use it:

  • You rarely would use this metric alone. Usually as an auxiliary one with some other metric,
  • If the cost of dealing with an alert is high you should consider increasing the threshold to get fewer alerts.

3. False Negative Rate | Type II error

When we donā€™t predict something when it is, we are contributing to the false negative rate. You can think of it as a fraction of missed fraudulent transactions that your model lets through.

false negative rate

How to compute:

from sklearn.metrics import confusion_matrix

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
false_negative_rate = fn / (tp + fn)

# log score to neptune
run[ā€œlogs/false_negative_rateā€] = false_negative_rate

How models score in this metric (threshold=0.5):

We can see that in our example, type-2 errors are quite a bit higher then type-1 errors. Interestingly our BIN-98 experiment that had the lowest type-1 error has the highest type-2 error. There is a simple explanation based on the fact that our dataset is imbalanced and with type-2 error we donā€™t have true negatives in the denominator.

How does it depend on the threshold:

false negative rate by threshold

If we decrease the threshold, more observations will be classified as positive. At certain threshold, we will mark everything as positive (fraudulent for example). We can actually get to the FNR of 0.083 by decreasing the threshold to 0.01.

When to use it:

  • Usually, it is not used alone but rather with some other metric,
  • If the cost of letting the fraudulent transactions through is high and the value you get from the users isnā€™t you can consider focusing on this number.

4. True Negative Rate | Specificity

It measures how many observations out of all negative observations have we classified as negative. In our fraud detection example, it tells us how many transactions, out of all non-fraudulent transactions, we marked as clean.

true negative rate

How to compute:

from sklearn.metrics import confusion_matrix

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
true_negative_rate = tn / (tn + fp)

# log score to neptune
run[ā€logs/true_negative_rateā€] = true_negative_rate

How models score in this metric (threshold=0.5):

Very high specificity for all the models. If you think about it, in our imbalanced problem you would expect that. Classifying negative cases as negative is a lot easier than classifying positive cases and hence the score is high.

How does it depend on the threshold:

true negative rate by threshold

The higher the threshold the more observations are truly negative observations we can recall. We can see that starting from say threshold=0.4 our model is doing really well in classifying negative cases as negative.

When to use it:

  • Usually, you donā€™t use it alone but rather as an auxiliary metric,
  • When you really want to be sure that you are right when you say something is safe. A typical example would be a doctor telling a patient ā€œyou are healthyā€. Making a mistake here and telling a sick person they are safe and can go home is something you may want to avoid.

5. Negative Predictive Value

It measures how many predictions out of all negative predictions were correct. You can think of it as precision for negative class. With our example, it tells us what is the fraction of correctly predicted clean transactions in all non-fraudulent predictions.

negative predictive value

How to compute:

from sklearn.metrics import confusion_matrix

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
negative_predictive_value = tn/ (tn + fn)

# log score to neptune
run[ā€logs/negative_predictive_valueā€] = negative_predictive_value

How models score in this metric (threshold=0.5):

All models score really high and no wonder, since with an imbalanced problem it is easy to predict negative class.

How does it depend on the threshold:

negative predictive value by threshold

The higher the threshold the more cases are classified as negative and the score goes down. However, in our imbalanced example even at a very high threshold, the negative predictive value is still good.

When to use it:

  • When we care about high precision on negative predictions. For example, imagine we really donā€™t want to have any additional process for screening the transactions predicted as clean. In that case, we may want to make sure that our negative predictive value is high.

6. False Discovery Rate

It measures how many predictions out of all positive predictions were incorrect. You can think of it as simply 1-precision. With our example, it tells us what is the fraction of incorrectly predicted fraudulent transactions in all fraudulent predictions.

false discovery rate

How to compute:

from sklearn.metrics import confusion_matrix

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
false_discovery_rate = fp/ (tp + fp)

# log score to neptune
run[ā€œlogs/false_discovery_rateā€] = false_discovery_rate

How models score in this metric (threshold=0.5):

The ā€œbest modelā€ is incredibly shallow lightGBM which we expect to be incorrect (deeper model should work better).

That is an important takeaway, looking at precision (or recall) alone can lead to you selecting a suboptimal model.

How does it depend on the threshold:

false discovery rate by threshold

The higher the threshold, the less positive predictions. The less positive predictions, the ones that are classified as positive have higher certainty scores. Hence, the false discovery rate goes down.

When to use it

  • Again, it usually doesnā€™t make sense to use it alone but rather coupled with other metrics like recall.
  • When raising false alerts is costly and when you want all the positive predictions to be worth looking at you should optimize for precision.

7. True Positive Rate | Recall | Sensitivity

It measures how many observations out of all positive observations have we classified as positive. It tells us how many fraudulent transactions we recalled from all fraudulent transactions.

true positive rate

When you are optimizing recall you want to put all guilty in prison.

How to compute:

from sklearn.metrics import confusion_matrix, recall_score

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
recall = recall_score(y_true, y_pred_class) # or optionally tp / (tp + fn)

# log score to neptune
run[ā€œlogs/recall_scoreā€] = recall

How models score in this metric (threshold=0.5):

Our best model can recall 0.72 fraudulent transactions at the threshold 0.5. the difference in recall between our models is quite significant and we can clearly see better and worse models. Of course, for every model, we can adjust the threshold to recall all fraudulent transactions.

How does it depend on the threshold:

true positive rate by threshold

For the threshold of 0.1, we classify the vast majority of transactions as fraudulent and hence get really high recall of 0.917. As the threshold increases the recall falls.

When to use it:

  • Usually, you will not use it alone but rather coupled with other metrics like precision.
  • That being said, recall is a go-to metric, when you really care about catching all fraudulent transactions even at a cost of false alerts. Potentially it is cheap for you to process those alerts and very expensive when the transaction goes unseen.

8. Positive Predictive Value | Precision

It measures how many observations predicted as positive are in fact positive. Taking our fraud detection example, it tells us what is the ratio of transactions correctly classified as fraudulent.

positive predictive value

When you are optimizing precision you want to make sure that people that you put in prison are guilty.

How to compute:

from sklearn.metrics import confusion_matrix, precision_score

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
precision = precision_score(y_true, y_pred_class) # or optionally tp/ (tp + fp)

# log score to neptune
run[ā€œlogs/precision_scoreā€] = precison

How models score in this metric (threshold=0.5):

It seems like all the models have pretty high precision at this threshold. The ā€œbest modelā€ is incredibly shallow lightGBM which obviously smells fishy. That is an important takeaway, looking at precision (or recall) alone can lead to you selecting a suboptimal model. Of course, for every model, we can adjust the threshold to increase precision. That is because if we take a small fraction of high scoring predictions the precision on those will likely be high.

How does it depend on the threshold:

positive predictive value by threshold

The higher the threshold the better the precision and with a threshold of 0.68 we can actually get a perfectly precise model. Over this threshold, the model doesnā€™t classify anything as positive and so we donā€™t plot it.

When to use it:

  • Again, it usually doesnā€™t make sense to use it alone but rather coupled with other metrics like recall.
  • When raising false alerts is costly, when you want all the positive predictions to be worth looking at you should optimize for precision.

9. Accuracy

It measures how many observations, both positive and negative, were correctly classified.

accuracy

You shouldnā€™t use accuracy on imbalanced problems. Then, it is easy to get a high accuracy score by simply classifying all observations as the majority class. For example in our case, by classifying all transactions as non-fraudulent we can get an accuracy of over 0.9.

How to compute:

from sklearn.metrics import confusion_matrix, accuracy_score

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
accuracy = accuracy_score(y_true, y_pred_class) # or optionally (tp + tn) / (tp + fp + fn + tn) 

# log score to neptune
run[ā€œlogs/accuracyā€] = accuracy

How models score in this metric (threshold=0.5):

We can see that for all the models we beat the dummy model (all clean transactions) by a large margin. Also the models that weā€™d expect to be better are in fact at the top.

How does it depend on the threshold:

accuracy by threshold

With accuracy, you can really use charts like the one above to determine the optimal threshold. In this case, choosing something a bit over standard 0.5 could bump the score by a tiny bit 0.9686->0.9688.

When to use it:

  • When your problem is balanced using accuracy is usually a good start. An additional benefit is that it is really easy to explain it to non-technical stakeholders in your project,
  • When every class is equally important to you.

10. F beta score

Simply put, it combines precision and recall into one metric. The higher the score the better our model is. You can calculate it in the following way:

f beta

When choosing beta in your F-beta score the more you care about recall over precision the higher beta you should choose. For example, with F1 score we care equally about recall and precision with F2 score, recall is twice as important to us.

F beta by beta

With 0<beta<1 we care more about precision and so the higher the threshold the higher the F beta score. When beta>1 our optimal threshold moves toward lower thresholds and with beta=1 it is somewhere in the middle.

How to compute:

from sklearn.metrics import fbeta_score

y_pred_class = y_pred_pos > threshold
fbeta = fbeta_score(y_true, y_pred_class, beta)

# log score to neptune
run["logs/fbeta_score"] = fbeta

11. F1 score (beta=1)

Itā€™s the harmonic mean between precision and recall.

How to compute:

from sklearn.metrics import f1_score
y_pred_class = y_pred_pos > threshold
f1= f1_score(y_true, y_pred_class)

# log score to neptune
run[ā€œlogs/f1_scoreā€] = f1

How models score in this metric (threshold=0.5):

As we can see combining precision and recall gave us a more realistic view of our models. We get 0.808 for the best one and a lot of room for improvement.

What is good is that it seems to be ranking our models correctly with those larger lightGBMs at the top.

How does it depend on the threshold:

f1 score by threshold

We can adjust the threshold to optimize F1 score. Notice that for both precision and recall you could get perfect scores by increasing or decreasing the threshold. Good thing is, you can find a sweet spot for F1metric. As you can see, getting the threshold just right can actually improve your score by a bit 0.8077->0.8121.

When to use it:

  • Pretty much in every binary classification problem. It is my go-to metric when working on those problems. It can be easily explained to business stakeholders.

12. F2 score (beta=2)

Itā€™s a metric that combines precision and recall, putting 2x emphasis on recall.

How to compute:

from sklearn.metrics import fbeta_score

y_pred_class = y_pred_pos > threshold
f2 = fbeta_score(y_true, y_pred_class, beta = 2)

# log score to neptune
run[ā€œlogs/f2_scoreā€] = f2

How models score in this metric (threshold=0.5):

This score is even lower for all the models than F1 but can be increased by adjusting the threshold considerably.Again, it seems to be ranking our models correctly, at least in this simple example.

How does it depend on the threshold:

f2 score by threshold

We can see that with a lower threshold and therefore more true positives recalled we get a higher score. You can usually find a sweet spot for the threshold. Possible gain from 0.755 -> 0.803 show how important threshold adjustments can be here.

When to use it:

  • Iā€™d consider using it when recalling positive observations (fraudulent transactions) is more important than being precise about it

13. Cohen Kappa Metric

In simple words, Cohen Kappa tells you how much better is your model over the random classifier that predicts based on class frequencies.

cohen kappa

To calculate it one needs to calculate two things: ā€œobserved agreementā€ (po) and ā€œexpected agreementā€ (pe). Observed agreement (po) is simply how our classifier predictions agree with the ground truth, which means it is just accuracy. The expected agreement (pe) is how the predictions of the random classifier that samples according to class frequencies agree with the ground truth, or accuracy of the random classifier.

From an interpretation standpoint, I like that it extends something very easy to explain (accuracy) to situations where your dataset is imbalanced by incorporating a baseline (dummy) classifier.

How to compute:

from sklearn.metrics import cohen_kappa_score

cohen_kappa = cohen_kappa_score(y_true, y_pred_class)

# log score to neptune
run[ā€œlogs/cohen_kappa_scoreā€] = cohen_kappa

How models score in this metric (threshold=0.5):

We can easily distinguish the worst/best models based on this metric. Also, we can see that there is still a lot of room to improve our best model.

How does it depend on the threshold:

cohen kappa by threshold

With the chart just like the one above we can find a threshold that optimizes cohen kappa. In this case, it is at 0.31 giving us some improvement 0.7909 -> 0.7947 from the standard 0.5.

When to use it:

  • This metric is not used heavily in the context of classification. Yet it can work really well for imbalanced problems and seems like a great companion/alternative to accuracy.

14. Matthews Correlation Coefficient MCC

Itā€™s a correlation between predicted classes and ground truth. It can be calculated based on values from the confusion matrix:

matthews correlation coefficient

Alternatively, you could also calculate the correlation between y_true and y_pred.

How to compute:

from sklearn.metrics import matthews_corrcoef

y_pred_class = y_pred_pos > threshold
matthews_corr = matthews_corrcoef(y_true, y_pred_class)
run[ā€œlogs/matthews_corrcoefā€] = matthews_corr

How models score in this metric (threshold=0.5):

We can clearly see improvements in our model quality and a lot of room to grow, which I really like. Also, it ranks our models reasonably and puts models that youā€™d expect to be better on top. Of course, MCC depends on the threshold that we choose.

How does it depend on the threshold:

matthews correlation coefficient by threshold

We can adjust the threshold to optimize MCC. In our case, the best score is at 0.53 but what I really like is that it is not super sensitive to threshold changes.

When to use it:

  • When working on imbalanced problems,
  • When you want to have something easily interpretable.

15. ROC Curve

It is a chart that visualizes the tradeoff between true positive rate (TPR) and false positive rate (FPR). Basically, for every threshold, we calculate TPR and FPR and plot it on one chart.

Of course, the higher TPR and the lower FPR is for each threshold the better and so classifiers that have curves that are more top-left side are better.

Extensive discussion of ROC Curve and ROC AUC score can be found in this article by Tom Fawcett.

How to compute:

from scikitplot.metrics import plot_roc

fig, ax = plt.subplots()
plot_roc(y_true, y_pred, ax=ax)

# log figure to neptune
run[ā€œimages/ROCā€].upload(neptune.types.File.as_image(fig))

How does it look:

roc curve

We can see a healthy ROC curve, pushed towards the top-left side both for positive and negative class. It is not clear which one performs better across the board as with FPR < ~0.15 positive class is higher and starting from FPR~0.15 the negative class is above.

16. ROC AUC score

In order to get one number that tells us how good our curve is, we can calculate the Area Under the ROC Curve, or ROC AUC score. The more top-left your curve is the higher the area and hence higher ROC AUC score.

Alternatively, it can be shown that ROC AUC score is equivalent to calculating the rank correlation between predictions and targets. From an interpretation standpoint, it is more useful because it tells us that this metric shows how good at ranking predictions your model is. It tells you what is the probability that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance.

How to compute:

from sklearn.metrics import roc_auc_score

roc_auc = roc_auc_score(y_true, y_pred_pos)

# log score to neptune
run[ā€œlogs/roc_auc_scoreā€] = roc_auc

How models score in this metric:

We can see improvements and the models that one would guess to be better are indeed scoring higher. Also, the score is independent of the threshold which comes in handy.

When to use it:

  • You should use it when you ultimately care about ranking predictions and not necessarily about outputting well-calibrated probabilities (read this article by Jason Brownlee if you want to learn about probability calibration).
  • You should not use it when your data is heavily imbalanced. It was discussed extensively in this article by Takaya Saito and Marc Rehmsmeier. The intuition is the following: false positive rate for highly imbalanced datasets is pulled down due to a large number of true negatives.
  • You should use it when you care equally about positive and negative classes. It naturally extends the imbalanced data discussion from the last section. If we care about true negatives as much as we care about true positives then it totally makes sense to use ROC AUC.

17. Precision-Recall Curve

It is a curve that combines precision (PPV) and Recall (TPR) in a single visualization. For every threshold, you calculate PPV and TPR and plot it. The higher on y-axis your curve is the better your model performance.

You can use this plot to make an educated decision when it comes to the classic precision/recall dilemma. Obviously, the higher the recall the lower the precision. Knowing at which recall your precision starts to fall fast can help you choose the threshold and deliver a better model.

How to compute:

from scikitplot.metrics import plot_precision_recall

fig, ax = plt.subplots()
plot_precision_recall(y_true, y_pred, ax=ax)

# log figure to neptune
run[ā€œimages/precision_recallā€].upload(neptune.types.File.as_image(fig))

How does it look:

precision recall curve

We can see that for the negative class we maintain high precision and high recall almost throughout the entire range of thresholds. For the positive class precision is starting to fall as soon as we are recalling 0.2 of true positives and by the time we hit 0.8, it decreases to around 0.7.

18. PR AUC score | Average precision

Similarly to ROC AUC score you can calculate the Area Under the Precision-Recall Curve to get one number that describes model performance.

You can also think about PR AUC as the average of precision scores calculated for each recall threshold [0.0, 1.0]. You can also adjust this definition to suit your business needs by choosing/clipping recall thresholds if needed.

How to compute:

from sklearn.metrics import average_precision_score

avg_precision = average_precision_score(y_true, y_pred_pos)

# log score to neptune
run[ā€œlogs/average_precision_scoreā€] = avg_precision

How models score in this metric:

The models that we suspect to be ā€œtrulyā€ better are in fact better in this metric which is definitely a good thing. Overall, we can see high scores but way less optimistic then ROC AUC scores (0.96+).

When to use it:

  • when you want to communicate precision/recall decision to other stakeholders
  • when you want to choose the threshold that fits the business problem.
  • when your data is heavily imbalanced. As mentioned before, it was discussed extensively in this article by Takaya Saito and Marc Rehmsmeier. The intuition is the following: since PR AUC focuses mainly on the positive class (PPV and TPR) it cares less about the frequent negative class.
  • when you care more about positive than negative class. If you care more about the positive class and hence PPV and TPR you should go with Precision-Recall curve and PR AUC (average precision).

19. Log loss

Log loss is often used as the objective function that is optimized under the hood of machine learning models. Yet, it can also be used as a performance metric.

Basically, we calculate the difference between ground truth and predicted score for every observation and average those errors over all observations. For one observation the error formula reads:

log loss

The more certain our model is that an observation is positive when it is, in fact, positive the lower the error. But this is not a linear relationship. It is good to take a look at how the error changes as that difference increases:

log loss

So our model gets punished very heavily when we are certain about something that is untrue. For example, when we give a score of 0.9999 to an observation that is negative our loss jumps through the roof. That is why sometimes it makes sense to clip your predictions to decrease the risk of that happening.

If you want to learn more about log-loss read this article by Daniel Godoy.

How to compute:

from sklearn.metrics import log_loss

loss = log_loss(y_true, y_pred)

# log score to neptune
run[ā€œlogs/log_lossā€] = loss

How models score in this metric:

It is difficult to really see strong improvement and get an intuitive feeling for how strong the model is. Also, the model that was chosen as the best one before (BIN-101) is in the middle of the pack. That can suggest that using log-loss as a performance metric can be a risky proposition.

When to use it:

  • Pretty much always there is a performance metric that better matches your business problem. Because of that, I would use log-loss as an objective for your model with some other metric to evaluate performance.

20. Brier score

It is a measure of how far your predictions lie from the true values. For one observation it simply reads:

brier score

Basically, it is a mean square error in the probability space and because of that, it is usually used to calibrate probabilities of the machine learning models. If you want to read more about probability calibration I recommend that you read this article by Jason Brownlee.

It can be a great supplement to your ROC AUC score and other metrics that focus on other things.

How to compute:

from sklearn.metrics import brier_score_loss

brier_loss = brier_score_loss(y_true, y_pred_pos)

# log score to neptune
run[ā€œlogs/brier_score_lossā€] = brier_loss

How models score in this metric:

Model from the experiment BIN-101 has the best calibration and for that model, on average our predictions were off by 0.16 (āˆš0.0263309).

When to use it:

  • When you care about calibrated probabilities.

21. Cumulative gains chart

In simple words, it helps you gauge how much you gain by using your model over a random model for a given fraction of top scored predictions.

Simply put:

  • you order your predictions from highest to lowest and
  • for every percentile you calculate the fraction of true positive observations up to that percentile.

It makes it easy to see the benefits of using your model to target given groups of users/accounts/transactions especially if you really care about sorting them.

How to compute:

from scikitplot.metrics import plot_cumulative_gain

fig, ax = plt.subplots()
plot_cumulative_gain(y_true, y_pred, ax=ax)

# log figure to neptune
run[ā€œimages/cumulative_gainsā€].upload(neptune.types.File.as_image(fig))

How does it look:

cumulative gains chart

We can see that our cumulative gains chart shoots up very quickly as we increase the sample of highest-scored predictions. By the time we get to the 20th percentile over 90% of positive cases are covered. You could use this chart to prioritize and filter out possible fraudulent transactions for processing. 

Say we were to use our model to assign possible fraudulent transactions for processing and we needed to prioritize. We could use this chart to tell us where it makes the most sense to choose a cutoff.

When to use it:

  • Whenever you want to select the most promising customers or transactions to target and you want to use your model for sorting.
  • It can be a good addition to ROC AUC score which measures ranking/sorting performance of your model.

22. Lift curve | lift chart

It is pretty much just a different representation of the cumulative gains chart:

  • we order the predictions from highest to lowest
  • for every percentile, we calculate the fraction of true positive observations up to that percentile for our model and for the random model,
  • we calculate the ratio of those fractions and plot it.

It tells you how much better your model is than a random model for the given percentile of top scored predictions.

How to compute:

from scikitplot.metrics import plot_lift_curve

fig, ax = plt.subplots()
plot_lift_curve(y_true, y_pred, ax=ax)

# log figure to neptune
run[ā€œimages/lift_curveā€].upload(neptune.types.File.as_image(fig))

How does it look:

lift chart

So for the top 10% of predictions, our model is over 10x better than random, for 20% is over 4x better and so on.

When to use it:

  • Whenever you want to select the most promising customers or transactions to target and you want to use your model for sorting.
  • It can be a good addition to ROC AUC score which measures ranking/sorting performance of your model.

23. Kolmogorov-Smirnov plot

KS plot helps to assess the separation between prediction distributions for positive and negative classes.

In order to create it you:

  • sort your observations by the prediction score,
  • for every cutoff point [0.0, 1.0] of the sorted dataset (depth) calculate the proportion of true positives and true negatives in this depth,
  • plot those fractions, positive(depth)/positive(all), negative(depth)/negative(all), on Y-axis and dataset depth on X-axis.

So it works similarly to cumulative gains chart but instead of just looking at positive class it looks at the separation between positive and negative class.

Good explanation of KS plot and KS statistic can be found in this article by Riaz Khan.

How to compute:

from scikitplot.metrics import plot_ks_statistic

fig, ax = plt.subplots()
plot_ks_statistic(y_true, y_pred, ax=ax)

# log figure to neptune
run[ā€œimages/kolmogorov-smirnovā€].upload(neptune.types.File.as_image(fig))

How does it look:

ks plot

So we can see that the largest difference is at a cutoff point of 0.034 of top predictions. After that threshold, it decreases at a moderate rate as we increase the percentage of top predictions. Around 0.8 it is really getting worse really fast. So even though the best separation is at 0.034 we could potentially push it a bit higher to get more positively classified observations.

24. Kolmogorov-Smirnov statistic

If we want to take the KS plot and get one number that we can use as a metric we can look at all thresholds (dataset cutoffs) from KS plot and find the one for which the distance (separation) between the distributions of true positive and true negative observations is the highest.

If there is a threshold for which all observations above are truly positive and all observations below are truly negative we get a perfect KS statistic of 1.0.

How to compute:

from scikitplot.helpers import binary_ks_curve

res = binary_ks_curve(y_true, y_pred_pos)
ks_stat = res[3]

# log score to neptune
run[ā€œlogs/ks_statisticā€] = ks_stat

How models score in this metric:

By using the KS statistic as the metric we were able to rank BIN-101 as the best model which we truly expect to be ā€œtrulyā€ best model.

When to use it:

  • when your problem is about sorting/prioritizing the most relevant observations and you care equally about positive and negative classes.
  • It can be a good addition to ROC AUC score which measures ranking/sorting performance of your model.

Final thoughts

In this blog post, youā€™ve learned about various classification metrics and performance charts.

We went over metric definitions, interpretations, we learned how to calculate them, and talked about when to use them.

Hopefully, with all that knowledge you will be fully equipped to deal with metric-related problems in your future projects.

Bonus:

To help you use the information from this blog post to the fullest, I have prepared://r//n//r//n

Check those out below!//r//n//r//n

Logging function

You can log all of those metrics and performance charts that we covered for your machine learning project and explore them in Neptune using our Python client.

  • install the client:
pip install neptune
  • import and run:
Import neptune

run = neptune.init_run(...)

# log score to neptune
run[ā€œlogs/scoreā€] = score
  • explore everything in the app:

Visit Neptune docs to see what you can log and display in the app. 

Binary classification metrics cheatsheet

Weā€™ve created a nice cheatsheet for you which takes all the content I went over in this blog post and puts it on a few-page, digestible document which you can print and use whenever you need anything binary classification metrics related.

Get your binary classification metrics cheatsheet

Was the article useful?

Thank you for your feedback!
What topics would you like to see for your next read
Let us know what should be improved

    Thanks! Your suggestions have been forwarded to our editors