Skip to main content
Skip to content

AUROC and AUPRC: What They Actually Tell You About Your Ranking Model

Your model's AUROC went from 0.78 to 0.80. Your A/B test shows statistical significance, and there's pressure to launch. But what does a 0.02 improvement actually mean? Will users notice? Is it worth the engineering effort to productionize?

I've been in this situation more times than I can count. The honest answer: it depends. And understanding what these metrics measure—from multiple angles—is the only way to make that call.

By the end of this article, you'll understand:

  • Five different interpretations of AUROC (and when each is useful)
  • Why AUPRC often matters more than AUROC for recsys tasks
  • What "good" values look like for CTR, conversion, and other common tasks
  • How to compute these metrics correctly in Python
  • When a 0.01 improvement matters and when it doesn't

#Every Ranking Model Is a Binary Classifier

First, why do these metrics apply to recommendation systems at all?

Every ranking model you build—CTR prediction, conversion likelihood, watch time estimation—is fundamentally a binary classification problem. Even if you ultimately use the scores for ranking (showing the highest-scored items first), the model itself is trained on binary labels: clicked or not clicked, converted or not converted, watched-to-completion or abandoned.

This means all the machinery from classification evaluation applies directly. AUROC and AUPRC aren't just for medical diagnosis or fraud detection—they're measuring exactly what your ranking model does: separate the good items from the bad ones in score space.

The key insight: your model's scores ARE your ranking signal. If an item gets score 0.85 and another gets 0.42, you're betting the first one is more relevant. AUROC and AUPRC measure how often that bet is correct.

#AUROC: Five Ways to Understand It

AUROC (Area Under the Receiver Operating Characteristic curve) is one of the most commonly reported metrics in machine learning, and also one of the most misunderstood. Let me give you five different ways to think about it—each useful in different contexts.

#1. The ROC Curve: Tracing the Trade-off

The ROC curve plots True Positive Rate (TPR) against False Positive Rate (FPR) as you sweep the classification threshold from 1.0 down to 0.0.

What does "sweeping the threshold" mean? Your model outputs a score between 0 and 1 for each item. To make a binary prediction (click / no click), you pick a threshold: any item with score above the threshold is predicted positive. By trying every possible threshold, you get every possible (FPR, TPR) pair — that's the curve.

Let's make this concrete with a CTR prediction example. Say your model outputs click probabilities for 1000 impressions: 50 of them actually got clicked (positives), 950 didn't (negatives).

  • True Positive Rate (TPR) = (correctly predicted clicks) / (total actual clicks)

    • At threshold 0.9: maybe you catch 5 of 50 clicks → TPR = 0.10
    • At threshold 0.5: maybe you catch 35 of 50 clicks → TPR = 0.70
    • At threshold 0.1: you catch 48 of 50 clicks → TPR = 0.96
  • False Positive Rate (FPR) = (incorrectly predicted clicks) / (total actual non-clicks)

    • At threshold 0.9: maybe 2 non-clicks have scores above 0.9 → FPR = 2/950 = 0.002
    • At threshold 0.5: maybe 100 non-clicks have scores above 0.5 → FPR = 100/950 = 0.105
    • At threshold 0.1: maybe 400 non-clicks have scores above 0.1 → FPR = 400/950 = 0.421

As you lower the threshold, you catch more positives (TPR goes up) but also make more false alarms (FPR goes up). The ROC curve traces out this trade-off—every point on the curve is a different operating point your model could achieve.

AUROC is the area under this curve. A perfect model (all positives scored higher than all negatives) achieves AUROC = 1.0. A random model achieves AUROC = 0.5. Anything below 0.5 means your model is worse than random—flip your predictions.

Interactive ROC & PR Curve Explorer

Score Distributions (positive class = 20% of data)

0.020.180.350.520.680.85010203040
0.50

ROC Curve — AUROC = 0.853

00.250.50.751FPR00.250.50.751TPR

PR Curve — AUPRC = 0.639

00.250.50.751Recall00.250.50.751Precision

Confusion Matrix at threshold = 0.50

Actual +
Actual −
Pred +
TP: 78
FP: 100
Pred −
FN: 22
TN: 300
TPR (Recall): 0.780
FPR: 0.250
Precision: 0.438
Total: 500

Drag the threshold slider to see how the operating point (yellow dot) moves along both curves. Notice how the random model has AUROC ≈ 0.50 and AUPRC ≈ positive rate (0.200).

Play with the threshold slider above. Notice how:

  • Moving the threshold changes your position on the ROC curve
  • Different model qualities produce different curve shapes
  • A better model has a curve that hugs the top-left corner

#2. Concordance: The Pairwise Ranking Interpretation

Here's the interpretation I find most useful for recsys work:

AUROC = P(score_positive > score_negative) for a randomly chosen positive-negative pair.

Pick a random clicked item and a random non-clicked item. What's the probability your model scores the clicked item higher? That probability is AUROC.

This is sometimes called the concordance probability or the Mann-Whitney U statistic (normalized). The math:

AUROC=iposjneg[1[scorei>scorej]+121[scorei=scorej]]pos×neg\text{AUROC} = \frac{\sum_{i \in \text{pos}} \sum_{j \in \text{neg}} \left[\mathbf{1}[\text{score}_i > \text{score}_j] + \tfrac{1}{2}\mathbf{1}[\text{score}_i = \text{score}_j]\right]}{|\text{pos}| \times |\text{neg}|}

Where 1[]\mathbf{1}[\cdot] is the indicator function (1 if the condition holds, 0 otherwise). Tied scores contribute 0.5 to the count — equivalent to breaking ties randomly.

Why this matters for recsys: Your ranking model's entire job is to rank good items above bad items. AUROC directly measures this. If AUROC = 0.80, then 80% of the time, when you compare a clicked item to a non-clicked item, the clicked item has a higher score.

AUROC as Concordance Probability

AUROC = P(scorepositive > scorenegative) for a randomly chosen pair. Each line connects a positive-negative pair. Green = concordant, red = discordant. Hover over a line to inspect the pair.

Positives (clicked)Negatives (not clicked)0.92Item A0.71Item B0.45Item C0.83Item D0.62Item E0.38Item F0.21Item G
Total Pairs
12
Concordant
9
Discordant
3
AUROC
0.750

AUROC = 9/12 = 0.750 — the fraction of positive-negative pairs where the model ranked the positive item higher

The visualization above shows this directly. Each line connects a positive (clicked) to a negative (non-clicked) example. Green lines are concordant pairs (positive scored higher); red lines are discordant pairs (positive scored lower). AUROC is just the fraction of green lines.

#3. Average TPR: The Integral View

There's another way to think about the area under the ROC curve:

AUROC=01TPR(FPR)d(FPR)\text{AUROC} = \int_0^1 \text{TPR}(\text{FPR}) \, d(\text{FPR})

This is just saying: AUROC is the average True Positive Rate, averaged across all possible False Positive Rates. Since we're integrating TPR as we slide along the FPR axis from 0 to 1, this integral is exactly the mathematical definition of the area under the ROC curve.

Another way to read it: for every possible FPR value from 0 to 1, what's the corresponding TPR? Average all those TPR values together.

This interpretation is useful when thinking about fairness or robustness. A model with AUROC = 0.85 doesn't necessarily have TPR = 0.85 at every FPR. It might have TPR = 0.95 at high FPR values but TPR = 0.60 at low FPR values. The curve shape matters, not just the area.

#4. Score Separation: The Distribution View

Another way to visualize AUROC: plot the score distributions for positives and negatives separately.

  • AUROC = 0.5: The two distributions overlap completely. Positives and negatives have the same score distribution—the model learned nothing.
  • AUROC = 0.7: Partial separation. The positive distribution is shifted right of the negative distribution, but there's substantial overlap.
  • AUROC = 0.9: Strong separation. Most positives have higher scores than most negatives, with only a small overlap region.
  • AUROC = 1.0: Perfect separation. No overlap—every positive scores higher than every negative.

This view helps diagnose model issues. If both distributions are concentrated near 0.5 with huge overlap, your features aren't discriminative. If the positive distribution is bimodal, you might have different subpopulations in your positive class that need different treatment.

#5. The Wilcoxon Rank-Sum Connection

For the statistically inclined: AUROC is equivalent to the Wilcoxon rank-sum test statistic (also called Mann-Whitney U), normalized by the product of sample sizes.

The Wilcoxon test asks: "Do these two samples come from distributions with different locations?" It does this by ranking all observations together and comparing the sum of ranks for each group.

If you've ever needed to statistically test whether your model's scores meaningfully separate the classes (not just report a point estimate), you can use the Wilcoxon test. The p-value tells you the probability of seeing this much separation under the null hypothesis that positives and negatives have identical score distributions.

In practice, I rarely use this for model evaluation—with recsys-scale data, everything is statistically significant. But it's useful to know the connection exists.

#AUPRC: When AUROC Isn't Enough

So when does AUROC fall short? When your classes are severely imbalanced — which is exactly the situation in most recsys tasks. A 2% CTR means 98% of your examples are negatives. AUROC can paint a rosy picture while your model is barely finding the needles in the haystack.

AUPRC (Area Under the Precision-Recall Curve) measures the same fundamental thing as AUROC — how well your model separates classes — but it does so in a way that's much more sensitive to performance on the minority class.

#The PR Curve

The Precision-Recall curve plots Precision against Recall as you sweep the threshold:

  • Precision = (true positives) / (predicted positives) = "Of the items I predicted would be clicked, how many actually were?"
  • Recall = (true positives) / (actual positives) = "Of the items that were actually clicked, how many did I catch?"

In recsys terms:

  • Precision at threshold 0.8: "If I only recommend items with score > 0.8, what fraction of those recommendations are actually clicked?"
  • Recall at threshold 0.8: "What fraction of the total clicks am I capturing by recommending items with score > 0.8?"

As you lower the threshold:

  • Recall goes up (you recommend more items, catching more clicks)
  • Precision typically goes down (you're less selective, so more recommendations are misses)

AUPRC is the area under this curve. Higher is better, but the baseline isn't 0.5—it's the positive class rate.

#Why AUPRC Matters More for RecSys

Here's the key insight: AUROC doesn't care about class imbalance, but AUPRC does.

Consider a CTR prediction task with 2% positive rate. You have 100 clicks and 4,900 non-clicks.

A model that scores every single example at 0.5 (completely useless) achieves:

  • AUROC = 0.5 (random)
  • AUPRC ≈ 0.02 (the positive rate)

A model that correctly ranks 80% of positive-negative pairs achieves:

  • AUROC = 0.80 (looks good!)
  • AUPRC might be only 0.15 (hmm...)

The AUROC looks respectable, but AUPRC reveals the truth: your model still struggles to surface the rare positive examples. At most thresholds, the majority of your predicted positives are actually negatives.

Class Imbalance: Why AUPRC Matters More Than AUROC

The same model quality produces very different AUPRC values depending on class balance, while AUROC remains stable. Adjust the positive class rate to see this effect.

5.0% (100 pos / 1900 neg)

ROC Curve — AUROC = 0.839

Random baseline: 0.500

00.250.50.751FPR00.250.50.751TPR

PR Curve — AUPRC = 0.308

Random baseline: 0.0500

00.250.50.751Recall00.250.50.751Precision

AUROC is stable across imbalance

Because AUROC measures concordance between positive and negative scores, it does not depend on the ratio between the classes. A model with the same score separation always gets roughly the same AUROC.

AUPRC drops with more imbalance

The random baseline for AUPRC equals the positive class rate. At 1% positive rate, random AUPRC = 0.01. A model with AUPRC = 0.15 at 1% imbalance is actually quite good — that is 15x above random.

The demo above makes this concrete. Adjust the positive rate and watch what happens:

  • AUROC stays stable (it's invariant to class balance)
  • AUPRC drops dramatically as positives become rarer
  • The random baseline for AUPRC equals the positive rate

For recsys tasks with typical 1-5% positive rates, AUPRC is often the more honest metric.

#Average Precision

In practice, you'll often see "Average Precision" (AP) reported instead of AUPRC. They're closely related but not identical, and the distinction trips people up.

Average Precision sums precision values at each threshold where a new positive is recalled:

AP=k(RkRk1)×Pk\text{AP} = \sum_k (R_k - R_{k-1}) \times P_k

Where kk indexes the thresholds in decreasing score order (each threshold is the score at which a new positive is recalled), and PkP_k, RkR_k are precision and recall at that threshold. This is a step-function (Riemann sum) approximation to the area under the PR curve.

AUPRC uses trapezoidal interpolation between adjacent points on the PR curve. In regions where precision drops steeply, trapezoidal interpolation can overestimate or underestimate the true area compared to step interpolation.

In practice, the values are usually very close. sklearn.metrics.average_precision_score computes AP (step-function), not the trapezoidal AUPRC. Most recsys teams use AP and call it "AUPRC" interchangeably — which is fine for comparing models, but worth knowing if you're ever comparing numbers across different implementations.

#A Note of Nuance

The claim that "AUPRC is better for imbalanced data" is common in applied ML, and there's real substance to it — but it's worth noting the picture is more nuanced than "always prefer AUPRC."

AUROC measures global ranking quality: across all possible thresholds, how well does your model separate classes? AUPRC emphasizes performance at the top of the ranking — the region where precision matters most. For recsys, where you only show the top-K items, that emphasis on the top of the ranking is often what you care about.

That said, AUROC remains the right metric when you care about the full ranking or when your operating point isn't fixed at the top. Both metrics answer different questions. Report both.

#Baseline Values: What Does "Good" Mean?

#Random Model Baselines

Let's establish what random performance looks like:

Random AUROC = 0.5, always, regardless of class balance.

Proof: A random model assigns scores independently of the true labels. So for any positive-negative pair, P(score_pos > score_neg) = 0.5 by symmetry.

Random AUPRC = positive class rate.

Proof sketch: A random model's precision at any recall level equals the positive rate. The area under a horizontal line at height pp (from recall 0 to 1) is just pp.

This is why raw AUPRC values are hard to interpret without knowing the positive rate. AUPRC = 0.10 is excellent if your positive rate is 1%, but terrible if it's 10%.

#The "Lift Over Random" Framing

For AUPRC, I find it helpful to report the lift over random (sometimes called "gain" or "uplift factor" — this isn't standardized terminology, but the concept is widely used in industry):

Lift=AUPRCpositive rate\text{Lift} = \frac{\text{AUPRC}}{\text{positive rate}}

A lift of 1.0 means random performance. A lift of 5x means your model is 5 times better than random at surfacing positives.

For a CTR prediction task with 2% CTR:

  • AUPRC = 0.02 → Lift = 1x (useless)
  • AUPRC = 0.10 → Lift = 5x (decent)
  • AUPRC = 0.20 → Lift = 10x (good)
  • AUPRC = 0.40 → Lift = 20x (excellent)

#Industry Benchmarks

Here are typical ranges I've seen across different recsys tasks:

Metric Benchmarks for RecSys Tasks

Typical ranges based on published papers and industry practice. Your mileage will vary — these depend heavily on feature quality, data volume, and task difficulty.

TaskPos. RateRandomTypicalGoodGreat
AUCAUPRCAUCAUPRCAUCAUPRCAUCAUPRC
CTR Prediction
Display ads, feed ranking
1–5%0.5000.01–0.050.70–0.780.08–0.150.78–0.850.15–0.300.85+0.30+
Conversion Prediction
Purchase, subscribe
0.1–1%0.5000.001–0.010.72–0.800.02–0.080.80–0.880.08–0.200.88+0.20+
Video Watch Pred.
Will user watch >50%
10–30%0.5000.10–0.300.68–0.750.30–0.450.75–0.820.45–0.600.82+0.60+
Fraud / Abuse
Fake reviews, spam
0.01–0.5%0.500<0.0050.85–0.920.05–0.200.92–0.970.20–0.500.97+0.50+

Key insight: Always compare AUPRC relative to the random baseline (= positive rate). An AUPRC of 0.10 on a task with 1% positive rate is 10x above random — that is a strong model. The same AUPRC of 0.10 on a 50% balanced task would be terrible (below random).

These are rough guidelines. Your specific task could be easier or harder depending on:

  • Feature quality and quantity
  • How well-defined the positive class is
  • User behavior consistency
  • Cold start challenges

#Rules of Thumb for AUROC

After working on ranking models for years, here's my mental model:

  • AUROC < 0.55: Your model barely beats random. Check for bugs, feature leakage, or insufficient signal.
  • AUROC 0.55-0.65: Weak signal. The model learned something, but not much. Good enough for a first prototype, not for production.
  • AUROC 0.65-0.75: Typical for a decent first model on a challenging task. Room for improvement, but usable.
  • AUROC 0.75-0.85: Good, well-engineered model. This is where most production recsys models land after proper feature engineering.
  • AUROC > 0.85: Either you have an excellent model, an easy task, or data leakage. Investigate before celebrating.

Warning signs:

  • AUROC > 0.95 almost always indicates leakage (you're accidentally using the label as a feature)
  • A/B test results that don't match offline AUROC improvements suggest overfitting or offline/online distribution mismatch

#Computing AUROC and AUPRC in Python

Let's get practical. Here's complete, runnable code for computing these metrics.

Python
import numpy as np
from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.metrics import roc_curve, precision_recall_curve

# Generate synthetic recsys-like data
np.random.seed(42)
n_samples = 10000
positive_rate = 0.03  # 3% CTR, typical for many recsys tasks

# True labels
y_true = np.random.binomial(1, positive_rate, n_samples)
n_pos = y_true.sum()
n_neg = n_samples - n_pos
print(f"Positives: {n_pos}, Negatives: {n_neg}, Rate: {n_pos/n_samples:.2%}")

# Simulate three models with different quality levels
def make_scores(y_true, signal_strength):
    """Generate scores with controllable discrimination."""
    noise = np.random.normal(0, 1, len(y_true))
    signal = y_true * signal_strength
    scores = 1 / (1 + np.exp(-(signal + noise)))  # Sigmoid to [0,1]
    return scores

scores_random = np.random.uniform(0, 1, n_samples)
scores_weak = make_scores(y_true, signal_strength=1.0)
scores_decent = make_scores(y_true, signal_strength=2.0)
scores_good = make_scores(y_true, signal_strength=3.0)

# Compute metrics
for name, scores in [("Random", scores_random),
                     ("Weak", scores_weak),
                     ("Decent", scores_decent),
                     ("Good", scores_good)]:
    auroc = roc_auc_score(y_true, scores)
    auprc = average_precision_score(y_true, scores)
    lift = auprc / positive_rate
    print(f"{name:8s}: AUROC={auroc:.3f}, AUPRC={auprc:.3f}, Lift={lift:.1f}x")

Output (exact counts vary due to random sampling):

Positives: 285, Negatives: 9715, Rate: 2.85%
Random  : AUROC=0.502, AUPRC=0.029, Lift=1.0x
Weak    : AUROC=0.691, AUPRC=0.078, Lift=2.7x
Decent  : AUROC=0.813, AUPRC=0.195, Lift=6.8x
Good    : AUROC=0.895, AUPRC=0.374, Lift=13.1x

Notice how AUPRC shows a much wider range than AUROC, making it easier to distinguish model quality.

#Computing AUROC From Scratch

Understanding the concordance calculation helps build intuition:

Python
def auroc_from_concordance(y_true, scores):
    """Compute AUROC by counting concordant pairs."""
    pos_scores = scores[y_true == 1]
    neg_scores = scores[y_true == 0]

    concordant = 0
    total_pairs = len(pos_scores) * len(neg_scores)

    for p_score in pos_scores:
        # Count how many negatives this positive beats
        concordant += (p_score > neg_scores).sum()
        # Handle ties: count as 0.5
        concordant += 0.5 * (p_score == neg_scores).sum()

    return concordant / total_pairs

# Verify it matches sklearn
auroc_manual = auroc_from_concordance(y_true, scores_decent)
auroc_sklearn = roc_auc_score(y_true, scores_decent)
print(f"Manual: {auroc_manual:.6f}")
print(f"Sklearn: {auroc_sklearn:.6f}")

This O(n_pos × n_neg) algorithm is slow for large datasets, but it makes the concordance interpretation crystal clear. Production implementations use sorting-based algorithms that run in O(n log n).

#Plotting the Curves

Always look at the actual curves, not just the areas:

Python
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# ROC curves
ax = axes[0]
for name, scores in [("Random", scores_random),
                     ("Decent", scores_decent),
                     ("Good", scores_good)]:
    fpr, tpr, _ = roc_curve(y_true, scores)
    auroc = roc_auc_score(y_true, scores)
    ax.plot(fpr, tpr, label=f"{name} (AUC={auroc:.3f})")
ax.plot([0, 1], [0, 1], 'k--', label="Random baseline")
ax.set_xlabel("False Positive Rate")
ax.set_ylabel("True Positive Rate")
ax.set_title("ROC Curves")
ax.legend()

# PR curves
ax = axes[1]
for name, scores in [("Random", scores_random),
                     ("Decent", scores_decent),
                     ("Good", scores_good)]:
    precision, recall, _ = precision_recall_curve(y_true, scores)
    auprc = average_precision_score(y_true, scores)
    ax.plot(recall, precision, label=f"{name} (AP={auprc:.3f})")
ax.axhline(y=positive_rate, color='k', linestyle='--', label="Random baseline")
ax.set_xlabel("Recall")
ax.set_ylabel("Precision")
ax.set_title("Precision-Recall Curves")
ax.legend()

plt.tight_layout()
plt.savefig("roc_pr_curves.png", dpi=150)

Look at the curve shapes, not just the areas. Two models with identical AUROC can have very different curves—one might dominate at low FPR (useful if false positives are expensive), another at high FPR.

#Practical Advice for RecSys Engineers

#When to Prefer AUROC vs AUPRC

Use AUROC when:

  • Comparing models on the same dataset (AUROC is scale-invariant)
  • Communicating with stakeholders who expect it
  • Class balance is moderate (10%+ positive rate)
  • You care equally about all operating points

Use AUPRC when:

  • Class imbalance is severe (< 5% positive rate)
  • False negatives are more costly than false positives
  • You want to surface rare positive events (fraud, conversion, viral content)
  • Comparing across datasets with different positive rates (use lift)

Best practice: Report both. They tell you different things. AUROC = 0.82, AUPRC = 0.15 tells a very different story than AUROC = 0.82, AUPRC = 0.45.

#Interpreting Small Improvements

"We improved AUROC from 0.782 to 0.791. Should we launch?"

This depends on several factors:

Statistical significance: With large enough samples, tiny differences are significant. Significance doesn't imply importance.

Effect on ranking: A 0.009 AUROC improvement means roughly 0.9 percentage points more concordant pairs. In a catalog of 1M items where you show 10 recommendations, this might mean the clicked item moves up by 0.1 positions on average. Is that noticeable?

Online vs offline: Offline metric improvements often don't fully translate to online gains. A 0.01 AUROC improvement offline might yield a 0.1% CTR lift online, or it might yield nothing due to distribution shift.

Cost-benefit: What did this improvement cost? A 0.005 AUROC gain from 6 months of research isn't worth it. The same gain from fixing a feature bug in one day absolutely is.

My rough heuristic: 0.01 AUROC improvement is worth investigating, 0.02+ is worth launching (assuming the improvement is real and not overfit). But always validate with online experiments.

#Look at the Full Curve

Summarizing a curve as a single number loses information. Two models with AUROC = 0.80 might have very different characteristics:

  • Model A: High TPR at low FPR, flattens out. Good at finding the "obvious" positives.
  • Model B: Low TPR at low FPR, climbs steadily. Better at ranking the "borderline" cases.

If your system only shows the top 10 recommendations (low FPR), Model A is better. If you're re-ranking a pre-filtered candidate set (moderate FPR), Model B might win.

Always plot the curves before making launch decisions.

#Calibration vs Discrimination

AUROC and AUPRC measure discrimination—can the model separate classes in rank order? They say nothing about calibration—does a predicted probability of 0.7 mean 70% of such predictions are positive?

A model can have perfect AUROC (1.0) but be badly calibrated. If all positives get score 0.51 and all negatives get score 0.49, ranking is perfect but probabilities are meaningless.

For ranking tasks, discrimination is usually what matters. But if you're using predicted probabilities directly (e.g., in an auction or to set expectations), calibration matters too. Use calibration plots and Brier score alongside AUROC/AUPRC.

#AUROC vs Log Loss

Log loss (binary cross-entropy) is the most common training objective, but it measures something different from AUROC:

  • Log loss penalizes confident wrong predictions heavily. A prediction of 0.99 for a negative example hurts a lot.
  • AUROC only cares about ranking. It doesn't matter if positives get scores of 0.51 vs 0.99, as long as they beat the negatives.

You can have:

  • Low log loss, low AUROC (well-calibrated but not discriminative)
  • High log loss, high AUROC (discriminative but overconfident)
  • Low log loss, high AUROC (the ideal)

In practice, optimizing log loss usually improves AUROC too. But if they diverge, it's worth investigating why.

#Segment Your Evaluation

Overall AUROC can hide problems in subpopulations:

  • New users: Cold start is hard. Your model might have AUROC = 0.85 overall but AUROC = 0.60 for users with < 10 interactions.
  • New items: Same issue. That trending video uploaded yesterday has no engagement signal.
  • Different user segments: Power users vs casual browsers might have very different patterns.

Always compute metrics per-segment. A model that's great overall but terrible for new users will hurt growth. A model that's bad for your highest-value segment will hurt revenue.

#Key Takeaways

Let me summarize the most important points:

  1. AUROC = P(score_positive > score_negative). This concordance interpretation is the most useful for recsys. Your model's job is to rank positives above negatives; AUROC measures exactly that.

  2. AUPRC is more informative for imbalanced data. With 2% CTR, random AUPRC is 0.02. A model with AUPRC = 0.20 has 10x lift over random — that tells you much more than AUROC = 0.80 alone.

  3. Report both metrics, plus lift. AUROC tells you about ranking ability, AUPRC tells you about ability to surface rare positives. Lift over random makes AUPRC comparable across different class balances.

  4. Look at the curves, not just the areas. Two models with the same AUROC can have very different operating characteristics. The curve shape tells you where your model is strong and weak.

  5. Small AUROC improvements (< 0.01) are rarely meaningful. Unless you're at massive scale, you probably won't see real-world impact. Spend your time on bigger wins.

  6. Segment your evaluation. Overall metrics hide problems in subpopulations. Check new users, new items, and your most important segments separately.

The single most important thing to remember: AUROC and AUPRC measure your model's ability to rank—to put positives above negatives in score order. They don't measure calibration, fairness, latency, or any of the other things that matter for production systems. They're necessary but not sufficient for a good ranking model.

Now go look at your model's PR curve. Is it as good as you thought?