Interpreting AUC-ROC

data science
R
Author

Matt Kaye

Published

March 9, 2023

AUC goes by many names: AUC, AUC-ROC, ROC-AUC, the area under the curve, and so on. It’s an extremely important metric for evaluating machine learning models and it’s an uber-popular data science interview question. It’s also, at least in my experience, the single most commonly misunderstood metric in data science.

I’ve heard several common misunderstandings or flat-out falsehoods from people in all kinds of roles discussing AUC. The biggest offenses tend to come from overcomplicating the topic. It’s easy to see the Wikipedia page for the ROC curve and be confused, intimidated, or some combination of the two. ROC builds off of other fundamental data science concepts – the true and false positives rates of a classifier – so it’s certainly not a good place to start learning about metrics for evaluating the performance of models.

The most common cause for confusion about AUC seems to come from the plot of the ROC curve, and nothing particularly special about AUC itself. Generally, I’ll hear AUC explained as being the area under the ROC curve, and that it’s all about testing how well your model balances false positives and false negatives. That’s all well and good, but it doesn’t give someone new to AUC any intuition about what AUC actually means in practice. For instance, let’s imagine we’re trying to predict the chance that a student is accepted at Carleton College – a quite common problem at CollegeVine! How does saying “AUC tells me about how my model is balancing false negatives and false positives” tell me anything about how well my model is doing at predicting that student’s chances?

The main issue I have with this factual-yet-unhelpful explanation of AUC is just that: While it may be true, it doesn’t get to the point. And even worse, it’s sometimes used as a crutch: A fallback answer when someone feels stuck when asked how to interpret AUC in real, practical terms.

So in this post, I’ll focus on just one thing, then: Answering the question above about how to interpret AUC.

What is AUC?

As I mentioned, it’s usually not helpful to try to explain AUC to someone by telling them that it’s just the area under the ROC curve, or that it’s a metric you can use for predicting probabilities as opposed to predicting classes, or that it’s a metric trying to balance false positives and false negatives. None of those things get to the crux of the problem.

So what is AUC, then? It’s pretty simple: Let’s imagine a model \(M\) being evaluated on data \(X\) where \(X\) contains some instances of the true class and some instances of the false class. The AUC of \(M\) on \(X\) is the probability that given a random item from \(X\) belonging to the true class (\(T\)) and another random item from \(X\) belonging to the false class (\(F\)), that the model predicts that the probability of \(T\) being true (belonging to the true class) is higher than the probability of \(F\) being true (belonging to the true class).

Let’s go back to the example about Carleton admissions, and let’s imagine that we have a model that gives a probability of admission to Carleton given some information about a student. If I give the model one random accepted student and one random rejected student, the AUC of the model is the probability that the accepted student had a higher chance of acceptance (as estimated by the model) than the rejected student did.

For more on this, I’d refer everyone to this fantastic blog post by the team at Google, which does a great job at explaining further and/or better.

A Simple Implementation

The easiest way to convey this idea might be to show a simple implementation of AUC. Below is some R code.

First, let’s start by writing a function to do exactly what’s described above. Again, here’s the algorithm given some evaluation data:

  1. Choose a random item from the true class.
  2. Choose a random item from the false class.
  3. Make a prediction on each of the two items.
  4. If the predicted probability for the actually true item is greater than the predicted probability for the actually false item, return true. Otherwise, return false. If they’re equal, flip a coin.
  5. Repeat 1-4 many times, and calculate the proportion of the time your model guessed correctly. This is your AUC.

Now, let’s write this in R with a little help from some vectorization.

library(rlang)
library(dplyr)
library(tibble)

## Our AUC implementation
## In this implementation, we take a data frame containing a "truth" (i.e. whether
## the example is _actually_ in either the true class or the false class)
## and an "estimate" (our predicted probability).
## This implementation is in line with how {{yardstick}} implements all of its metrics
interpretable_auc <- function(data, N, truth_col = "truth", estimate_col = "estimate") {
  
  ## First, subset the data down to just trues and just falses, separately
  trues <- filter(data, .data[[truth_col]] == 1)
  falses <- filter(data, .data[[truth_col]] == 0)
  
  ## Sample the predicted probabilities for N `true` examples, with replacement
  random_trues <- sample(trues[[estimate_col]], size = N, replace = TRUE)
  
  ## Do the same for N `false` examples
  random_falses <- sample(falses[[estimate_col]], size = N, replace = TRUE)

  ## If the predicted probability for the actually true
  ##  item is greater than that of the actually false item,
  ##  return `true`. 
  ## If the two are equal, flip a coin.
  ## Otherwise, return false.
  true_wins <- ifelse(
    random_trues == random_falses,
    runif(N) > 0.50,
    random_trues > random_falses
  )
  
  ## Compute the percentage of the time our model was "right"
  mean(true_wins)
}

Next, we can test our simple implementation against yardstick on some real data. For the sake of demonstration, I just used the built-in mtcars data. Here’s how the data looks:

library(knitr)
library(kableExtra)

## Doing a little data wrangling
data <- mtcars %>%
  transmute(
    vs = as.factor(vs),
    mpg,
    cyl
  ) %>%
  as_tibble()

data %>%
  slice_sample(n = 6) %>%
  kable("html", caption = 'Six rows of our training data') %>%
  kable_styling(position = "center", full_width = TRUE)
Six rows of our training data
vs mpg cyl
1 17.8 6
1 24.4 4
0 15.5 8
1 22.8 4
1 30.4 4
1 21.4 6

Now, let’s fit a few logistic regression models to the data to see how our AUC implementation compares to the yardstick one.

library(purrr)
library(yardstick)

## Simplest model -- Just an intercept. AUC should be 50%
model1 <- glm(vs ~ 1, data = data, family = binomial)

## Adding another predictor
model2 <- glm(vs ~ mpg, data = data, family = binomial)

## And another
model3 <- glm(vs ~ mpg + cyl, data = data, family = binomial)

## Make predictions for all three models
preds <- tibble(
  truth = data$vs,
  m1 = predict(model1, type = "response"),
  m2 = predict(model2, type = "response"),
  m3 = predict(model3, type = "response")
)

## For each model, compute AUC with both methods: Yardstick (library) and "homemade"
map_dfr(
  c("m1", "m2", "m3"),
  ~ {
    yardstick <- roc_auc(preds, truth = truth, estimate = !!.x, event_level = "second")$.estimate
    homemade <- interpretable_auc(preds, N = 100000, truth_col = "truth", estimate_col = .x)
    tibble(
      model = .x,
      yardstick = round(yardstick, digits = 2),
      homemade = round(homemade, digits = 2)
    )
  }
) %>%
  kable("html", caption = 'Yardstick vs. Our Implementation') %>%
  kable_styling(position = "center", full_width = TRUE)
Yardstick vs. Our Implementation
model yardstick homemade
m1 0.50 0.50
m2 0.91 0.91
m3 0.95 0.95

As we’ve seen here, AUC actually shouldn’t be all that much of a cause for confusion! The way I like to frame it is this: The AUC of your model is how good your model is at making even-odds bets. If I give your model two options and ask it to pick which one it thinks is more likely, a “better” model (by AUC standards) will be better at identifying the true class more often.

In real terms, that’s a meaningful, good thing. If we’re trying to predict the probability of a cancer patient having cancer, it’s important that our model can distinguish between people with cancer and people without it when given one person from each class. If it couldn’t - meaning the model was either randomly guessing or doing worse than random - the AUC would be 50% (or below 50%, in the worse-than-random disaster scenario).

Additional Thoughts

I also often hear the misconception that AUC is sensitive to things like class imbalance. This means that if the true class makes up a disproportionately large (or small) proportion of the evaluation data, that can skew the AUC. But based on the intuition we just built before, that’s of course not true. The key thing to remember is that the model is given one true and one false example. In choosing those, it doesn’t matter if the true class only makes up 0.005% of all of the examples in the evaluation data: AUC is only evaluating the model on its ability to determine which of the two is the true class.

However, there is one thing related to class imbalance, and just sample size in general, that would affect AUC, which is the raw number of examples of each class in the evaluation data. If, for instance, you had only a single instance of the true class in the evaluation set, then the AUC of the model is entirely determined by how good the predictions of the model are on that single example. For instance, if we have a single true class and the model predicts a 100% probability of it being true, then, assuming the predictions for all of the other examples in the evaluation set are not 100%, the AUC of the model as evaluated on that data is 100%. This isn’t necessarily because the model is “good” in any sense, but just because the model is over-indexing to a single good prediction in the evaluation set. In practice though, this AUC estimate wouldn’t generalize. As we got more data, the predictions for all the true classes would certainly not all be 100%, so the AUC of the model would go down over time.

Fortunately, there’s an easy fix for this problem. AUCs are a point estimate, but we could also estimate a margin of error or a confidence interval for our AUC. For a situation where we only have a single instance of the true class in the evaluation set, the margin of error for our AUC would be very wide.

Wrapping Up

Hopefully this post helped give a better intuition for what AUC actually is! A couple of major takeaways:

  1. AUC doesn’t need to be this super complicated thing about trading off between false positives and negatives and trying many different classification thresholds and such. In my opinion, it’s much simpler to just think about it as the likelihood of a guess that your model makes between two choices being correct.
  2. AUC isn’t affected by class imbalances.