Biases in Large-Scale Recommender Systems

Your model trained. AUC (Area Under the ROC Curve) improved by 2%. You deployed it. Live metrics went sideways.

If you've built recommender systems, you've probably experienced this frustrating disconnect between offline and online performance. The culprit is often bias in your training data. Not "bias" in the ethical sense (though that matters too), but systematic distortions between what you observe and what users actually prefer.

This article covers the six major biases in recommender systems, why they matter, and what you can do about them. After reading, you'll be able to:

Identify which biases affect your specific system
Understand why offline metrics often mislead
Apply practical debiasing techniques like inverse propensity scoring
Make informed trade-offs between accuracy and other objectives

The Fundamental Problem: Observational Data

Here's the core issue: user behavior data is observational, not experimental.

When a user clicks on an item, you observe it. When they don't click, you observe that too. But you have no idea why they didn't click. Did they see the item and dislike it? Did they never see it at all? Were they in a hurry and only looked at the top result?

As Chen et al. put it in their 2020 survey: "Blindly fitting the data without considering the inherent biases will result in many serious issues."

Here's the core problem: if you observe that users who clicked on "Star Wars" also clicked on "Marvel movies," you might conclude these users like blockbuster action. But maybe your recommendation system just kept showing them blockbusters. You observed a correlation between "user clicked Star Wars" and "user clicked Marvel," but you can't distinguish between:

Users who like Star Wars genuinely also like Marvel (true preference)
Users who like Star Wars were only ever shown Marvel as the next option (exposure bias)

Without running an experiment where you randomly show different users different options, you can't tell which explanation is true. This is why we need debiasing—to recover signal from confounded data.

The problems compound when you realize that your training data was generated by a previous recommendation system. You're training on data that reflects past model behavior, not just user preferences. This creates circular feedback that can amplify errors over time.

In experimental data, you control what users see and measure their responses. In observational data, users self-select what they interact with, and you only see the outcome of that selection process.

What Is Bias in Recommender Systems?

In this context, bias means a systematic deviation between what you observe and what's actually true about user preferences.

The causal perspective helps here. Schnabel et al. (2016) framed it elegantly: "Exposing a user to an item in a recommendation system is an intervention analogous to exposing a patient to a treatment in a medical study."

Just like medical researchers can't simply observe who takes medication (sicker people might self-select), you can't simply observe who clicks on items (exposure, position, and social influence all confound the signal).

I find it useful to categorize biases by where they enter the system:

Data biases (User -> Data): Selection, exposure, conformity, position
Model biases (Data -> Model): Inductive bias (often intentional)
Result biases (Model -> User): Popularity bias, unfairness

The dangerous part is that biases in the output become biases in the input for the next training cycle. This feedback loop can cause small initial errors to compound geometrically.

With this framework in mind, let's examine each bias in detail.

The Six Major Biases

1. Selection Bias (Missing Not At Random)

The problem: Users choose what to interact with. Unobserved interactions are not random.

This is probably the most fundamental bias. When you see a rating, it's because a user chose to rate that item. Users tend to rate items they like, items they feel strongly about, or items they've actually encountered. The missing ratings are not "Missing Completely At Random" (MCAR) - they're "Missing Not At Random" (MNAR).

Let $Y$ indicate whether the user would truly like the item if they saw it (their true preference). Mathematically, we can express selection bias as:

$P(\text{observe} \mid Y=1) \neq P(\text{observe} \mid Y=0)$

In other words, the probability of observing an interaction depends on whether the user would have liked the item.

Why it matters: If you train a model treating unobserved interactions as negatives (which many implicit feedback models do), you're assuming users rejected items they may never have seen.

Common Mistake

Many practitioners treat all unobserved user-item pairs as negative examples. This conflates "didn't interact" with "wouldn't like if seen." The two are very different.

Real-world evidence: Research on explicit rating data consistently shows that rating distributions skew positive (Chen et al., 2020). On Netflix, 75% of ratings are 4 stars or higher. Not because 75% of movies are good—but because people rate movies they liked. The missing ratings (the movies users started and stopped after 5 minutes) tell a very different story that never makes it into your training data.

2. Position Bias

The problem: Items shown higher get more clicks regardless of quality.

This one is well-documented and substantial. Joachims et al. (2017) showed that "the order of presentation has a strong influence on where users click." Users trust the first few results and often don't examine lower positions at all.

The standard model decomposes clicks into two independent factors:

$P(\text{click}) = P(\text{examine} \mid \text{position}) \times P(\text{click} \mid \text{examine, relevance})$

This is called the examination hypothesis. A user must first examine an item, then decide whether to click based on its relevance.

How bad is it? In my experience with search and recommendation systems, CTR (click-through rate) typically drops 30-50% between position 1 and position 2, and continues declining roughly logarithmically. Position 10 might get 10x fewer clicks than position 1, even for equally relevant items.

# Rough model of position bias
# Examination probability decays with position
def examination_prob(position, decay=0.8):
    """P(examine | position) modeled here with exponential/geometric decay."""
    return decay ** (position - 1)

# Example: positions 1-5
for pos in range(1, 6):
    print(f"Position {pos}: {examination_prob(pos):.2%} examination rate")

# Position 1: 100.00% examination rate
# Position 2: 80.00% examination rate
# Position 3: 64.00% examination rate
# Position 4: 51.20% examination rate
# Position 5: 40.96% examination rate

Why it matters: If you train on click data without accounting for position, you learn "items in position 1 are better" rather than "these items are genuinely preferred."

3. Exposure Bias

The problem: Users can only interact with items they're exposed to.

This is closely related to selection bias, but the cause is different. Selection bias comes from user choice; exposure bias comes from system choice. Your previous recommendation model decided what to show. Items that weren't shown couldn't be clicked, no matter how relevant they were.

Two flavors:

System-caused: Your ranker didn't surface the item
User-caused: The user didn't scroll far enough, or visited on a device with limited screen real estate

The consequence is that observed interactions are always a biased subset of potential interactions. Items with high historical exposure have more opportunities to receive feedback, creating a data imbalance.

Impact on catalog coverage: At scale, exposure bias often means the majority of interactions come from a small fraction of the catalog—power-law distributions (where a small number of items receive the vast majority of interactions) are typical. The "long tail" of items gets almost no signal, making it impossible to learn whether they're good or not.

4. Popularity Bias (Rich-Get-Richer)

The problem: Popular items get recommended more, which makes them more popular.

This is a result bias (it affects recommendations, not just data collection), but it creates feedback loops that amplify over time. The dynamic works like this:

Item A is slightly more popular initially
Model learns A is popular, recommends it more
A gets more clicks, becomes more popular
Model learns A is even more popular...

Abdollahpouri et al. (2024) define popularity bias with an impact-oriented framing: "A recommender system faces issues of popularity bias when the recommendations focus on popular items to the extent that they limit the value of the system or create harm."

The key insight is that recommending popular items isn't inherently bad. Popular items are often genuinely good. The problem is when popularity overwhelms other signals, reducing personalization and suppressing potentially relevant niche items.

The long-tail distribution: Most recommendation datasets follow a power-law distribution where a small "head" of items receives the vast majority of interactions. Training on this data produces models that perpetuate and amplify this imbalance.

The problem: Users modify their behavior based on others' opinions.

When users see that an item has 4.8 stars, they rate it higher than they would have otherwise. When they see a video has 10 million views, they're more likely to click and more likely to enjoy it (or at least report enjoying it).

This means "feedback does not always signify user true preference" (Chen et al., 2020). You're observing socially-influenced behavior, not pure preference.

Manifestations:

Rating anchoring: Visible ratings shift subsequent ratings toward the anchor
Herd behavior: High engagement counts attract more engagement
Social proof: "Others who bought this..." creates self-fulfilling prophecies

Why it's tricky: Conformity bias is partially legitimate signal. If everyone likes something, it probably is good. But it's also partially noise that reduces the diversity of feedback you receive.

6. Feedback Loop Bias

The problem: Your model trains on its own biased outputs.

This is the meta-bias that amplifies all the others. Every time you retrain on user interactions that were shaped by previous recommendations, you compound existing biases.

Chaney et al. (2018) demonstrated this with careful simulations. They set up a community of 100 users interacting with items over 1,000 time intervals, testing different recommendation algorithms. Their key finding: "Recommendation systems that include repeated training homogenize user behavior more than is needed for ideal utility."

The simulation results show clear patterns:

With a single training cycle, homogenization was temporary and moderate
With repeated training (retraining every iteration), homogenization became persistent and increased over time
Popularity-based recommendations caused the most global homogenization
Even content-based filtering contributed to the effect

The homogenization effect: Users exposed to similar recommendations start behaving more similarly. This reduces diversity in training data, which reduces diversity in recommendations, which further reduces behavioral diversity. It's a convergence spiral.

Who gets hurt? Users with minority preferences suffer disproportionately. As Chaney et al. observed: "Users who experience losses in utility generally have higher homogenization with their nearest neighbor." If your preferences don't match the majority, recommendations push you toward mainstream items, and your true preferences get less signal in the training data.

The algorithmic confounding problem: It becomes difficult to tell which user interactions stem from true preferences and which are influenced by recommendations. Your training data is contaminated by your own model's behavior, making it hard to learn anything genuinely new about user preferences.

The Compounding Problem

Feedback loops don't just maintain bias - they amplify it. Small initial biases can compound significantly over several retraining cycles. This is why periodic re-evaluation with fresh data is essential.

Understanding the six biases is the first step. The next question is: where do they actually hurt?

How Biases Affect Your System

Understanding where biases hurt helps prioritize which ones to address. The impact varies across the ML lifecycle.

Training: Learning Correlations, Not Causes

When your model fits biased data, it learns spurious correlations instead of causal relationships.

Example: If items in position 1 always get clicked, the model might learn "position 1 items are good" rather than "these specific items are good." If popular items dominate the training set, the model learns "recommend popular items" rather than learning to personalize.

The fundamental issue is distribution mismatch. You're training on $P_{\text{observed}}(Y \mid X)$ but you want to predict $P_{\text{true}}(Y \mid X)$ , where $X$ represents user and item features, and $Y$ represents the outcome (click, rating, purchase, etc.). These distributions differ systematically.

Concrete example: Consider a music recommendation system. Users who listen to Taylor Swift also tend to click on other pop artists, but that's partly because the system keeps showing them pop. If you train on this data naively, you'll learn "Taylor Swift fans like pop" when the truth might be "Taylor Swift fans would enjoy indie folk if they ever heard it."

The model becomes a self-fulfilling prophecy machine, not a preference discovery system.

Evaluation: Offline Metrics Mislead

This is where many teams get burned. Your offline evaluation uses a test set drawn from the same biased distribution as your training data. A model that learns biases well will score well on this test set - it's predicting the biased outcomes accurately!

But when you deploy, you're serving recommendations to the true distribution of user preferences. The mismatch shows up as: "We improved AUC by 2% but engagement dropped."

As Chen et al. note: "The discrepancy between offline evaluation and online metrics hurts user satisfaction and trust."

Why this happens:

Test data has the same position bias as training data - items in position 1 "look" better
Test data reflects historical exposure patterns - unexposed items have no ground truth
Test data inherits popularity skew - head items have more signal, look more predictable

A model that learns these biases fits the test distribution well. But that's not the distribution you care about serving.

Deployment: Winner's Curse and Policy Mismatch

Even if your model is better than the current production system, estimating how much better is hard. This is the counterfactual evaluation problem: you can't directly observe what would have happened under a different policy.

The "winner's curse" compounds this. If you test multiple models and select the one with the highest estimated improvement, you're likely overestimating its true improvement due to evaluation variance.

Policy mismatch in practice: Your training data was generated by policy A (the old model). You're evaluating policy B (the new model). But the outcomes you observe for policy B's recommendations were never collected under policy A. You're extrapolating into regions of the feature space with no data support.

This is why counterfactual evaluation methods (IPS, doubly robust estimators) exist - to estimate what would have happened under a different policy using data collected under the current policy.

Long-Term: Catalog Atrophy and User Fatigue

Over time, feedback loops cause:

Catalog atrophy: Items without exposure never get signal, so they never get recommended, so they never get exposure. New items and niche items slowly disappear from the system's effective catalog.
Filter bubbles: Users see narrower content over time, reducing discovery. A user who clicked on one political video might get pulled deeper into that ideology.
User fatigue: Repetitive recommendations lead to disengagement. Users stop trusting recommendations and rely more on search or external discovery.
Creator/supplier exodus: If content creators or product suppliers see their items never getting recommended, they leave the platform, reducing catalog quality.

These are slow-moving problems that won't show up in two-week A/B tests but erode system health over months and years. By the time you notice, the damage is done.

Now that we understand the problem, let's look at solutions.

Mitigation Strategies

1. Inverse Propensity Scoring (IPS)

The intuition: Reweight samples to "undo" the biased collection process.

If you know the probability that each observation would be collected (the propensity), you can upweight unlikely observations and downweight likely ones. This creates an unbiased estimate of the true distribution.

Think of propensity as answering: "How likely was it that we would have observed this particular interaction?" If an item only appears 10% of the time, observations of it are "surprising" and should count more heavily.

For a given observation $(u, i)$ with observed outcome $Y_{ui}$ , the propensity is:

$P_{ui} = P(\text{observe } (u,i))$

The IPS estimator uses these propensities to correct for biased sampling. Let $\delta$ represent any evaluation metric (e.g., squared error for rating prediction, binary cross-entropy for clicks), and $|U| \cdot |I|$ be the total number of possible user-item pairs:

$\hat{R}_{\text{IPS}} = \frac{1}{|U| \cdot |I|} \sum_{(u,i): \text{observed}} \frac{\delta(Y_{ui}, \hat{Y}_{ui})}{P_{ui}}$

Schnabel et al. (2016) showed this estimator is unbiased when propensities are correctly specified.

The variance problem: IPS has a major practical issue. When propensities are small (rare observations), the weights become huge, leading to high variance estimates.

Solutions:

# Clipped IPS - cap the maximum weight
def clipped_ips_weight(propensity, max_weight=10.0):
    """Clip IPS weights to reduce variance."""
    return min(1.0 / propensity, max_weight)

# Self-Normalized IPS (SNIPS) - normalize by sum of weights
def snips_estimator(observations, propensities, outcomes):
    """
    SNIPS normalizes by the sum of importance weights,
    providing lower variance at the cost of slight bias.
    """
    weights = 1.0 / propensities
    weighted_outcomes = weights * outcomes
    return weighted_outcomes.sum() / weights.sum()

SNIPS (Self-Normalized IPS) is often preferred in practice because it's more stable, though it introduces slight bias.

Practical Advice

You rarely know true propensities. Estimate them from your logging data - impression counts are often a reasonable proxy for exposure probability. Even rough estimates help significantly compared to ignoring bias entirely.

Doubly Robust Methods:

What if your propensity estimates are wrong? Doubly robust estimators combine IPS with an imputation model to be more robust to misspecification.

The idea is to use two components:

$\hat{Y}_{ui}$ : a predicted outcome from an imputation model (what we think the rating would be if we had observed it). In practice, this is often just your recommendation model's prediction—you use the same model for both the baseline estimate and the correction term.
$O_{ui}$ : an indicator that equals 1 if we observed pair $(u,i)$ , and 0 otherwise

The estimator uses the imputation as a baseline and applies an IPS correction where we have actual observations:

$\hat{R}_{\text{DR}} = \frac{1}{n} \sum_{(u,i)} \left[ \hat{Y}_{ui} + \frac{O_{ui}(Y_{ui} - \hat{Y}_{ui})}{P_{ui}} \right]$

The key property: this estimator is unbiased if either the propensity model or the imputation model is correct. You get two chances to be right.

In practice, doubly robust methods often outperform pure IPS, especially when propensities are estimated noisily. The imputation model provides a reasonable baseline, and IPS corrects where you have data.

2. Position Debiasing

Propensity estimation for position bias:

The cleanest approach is controlled experiments. Joachims et al. (2017) proposed a minimal intervention: swap documents between positions and measure CTR changes.

To estimate the ratio of examination probabilities between positions $i$ and $j$ , run a randomization experiment where you swap items between those positions, then measure the click-through rates in each position:

$\frac{P(\text{examine} \mid i)}{P(\text{examine} \mid j)} = \frac{\text{CTR}(\text{item at } i \text{ swapped from } j)}{\text{CTR}(\text{item at } j \text{ swapped from } i)}$

This reveals relative examination probabilities without destroying user experience.

Click models:

Alternatively, you can fit a click model to logged data. The Position-Based Model (PBM) assumes:

$P(\text{click} \mid \text{position}, \text{item}) = P(\text{examine} \mid \text{position}) \times P(\text{relevant} \mid \text{item})$

More sophisticated models account for user behavior dynamics:

Cascade Model: Users scan from top to bottom, stopping after a click. Examination probability depends on whether previous items were clicked.
Dynamic Bayesian Network (DBN): Models satisfaction - users might click but not be satisfied, then continue scanning.
User Browsing Model (UBM): Accounts for users returning to scan earlier positions after skipping.

These models are more realistic but harder to fit. In my experience, PBM works well enough for most applications and is much easier to implement. The more sophisticated models matter most when user behavior is genuinely sequential (like search results) rather than parallel (like a grid of recommendations).

Estimating position propensities without experiments:

If you can't run swap experiments, you can estimate propensities from observational data by assuming item relevance is independent of position (after controlling for the ranking model's score). This lets you fit position effects as a regression:

# Fit position propensities from logged data
# Assume: log(P(click)) = log(P(examine|pos)) + log(P(relevant|item))
# We can model this as: logit(click) ~ position_features + item_features

from sklearn.linear_model import LogisticRegression

def estimate_position_propensities(click_data):
    """
    Estimate relative position propensity scores from logged clicks.
    Returns log-odds coefficients (not absolute probabilities).
    Exponentiate the coefficients for use with IPS loss weighting.
    Assumes item relevance is independent of position assignment.
    """
    # One-hot encode positions
    position_features = pd.get_dummies(click_data['position'], prefix='pos')

    # Include item features to control for relevance
    item_features = click_data[['predicted_score', 'item_popularity']]

    X = pd.concat([position_features, item_features], axis=1)
    y = click_data['clicked']

    model = LogisticRegression()
    model.fit(X, y)

    # Extract position coefficients
    pos_coefs = {col: model.coef_[0][i]
                 for i, col in enumerate(position_features.columns)}

    return pos_coefs

This is noisier than experimental estimation but often good enough in practice.

Position debiasing in training:

Once you have position propensities, you can debias during training by weighting losses:

def position_debiased_loss(predictions, labels, positions, position_propensities):
    """
    Weight each example by inverse position propensity.
    Items in low-examination positions get upweighted.
    """
    weights = 1.0 / position_propensities[positions]
    # Optionally clip weights to reduce variance
    weights = torch.clamp(weights, max=10.0)

    per_example_loss = F.binary_cross_entropy(predictions, labels, reduction='none')
    return (per_example_loss * weights).mean()

3. Exposure-Aware Training

Negative sampling strategies:

Most implicit feedback models need negative examples. Instead of treating all unobserved items as negatives (which conflates "not exposed" with "not preferred"), you can:

Uniform negative sampling: Sample negatives uniformly from non-interacted items
Popularity-weighted sampling: Sample proportional to item popularity (oversamples items users likely saw but didn't click)
Exposure-weighted sampling: Use logged exposure data to sample items the user actually saw but didn't interact with

def exposure_aware_negative_sampling(user, positive_items, exposure_log, n_negatives):
    """
    Sample negatives from items the user was exposed to but didn't interact with.
    Falls back to popularity sampling for users with limited exposure data.
    """
    exposed_items = exposure_log.get(user, set())
    candidate_negatives = exposed_items - set(positive_items)

    if len(candidate_negatives) >= n_negatives:
        return random.sample(list(candidate_negatives), n_negatives)
    else:
        # Combine available candidates with fallback samples
        available = list(candidate_negatives)
        needed = n_negatives - len(available)
        return available + popularity_weighted_sample(needed)

Exposure-weighted losses:

If you know exposure probabilities, you can incorporate them directly into the loss:

$L = \sum_{(u,i) \in \text{positives}} -\log(\sigma(s_{ui})) + \sum_{(u,j) \in \text{negatives}} w_j \cdot (-\log(1 - \sigma(s_{uj})))$

Here $\sigma$ is the sigmoid function and $s_{ui}$ is the model's predicted relevance score for user $u$ and item $i$ . The weight $w_j$ upweights negatives that had high exposure probability—if a user saw an item but didn't interact, that's stronger negative signal than an item they never saw.

4. Popularity Calibration

The idea: Ensure recommendations reflect the user's actual preference distribution, not a popularity-skewed version.

Steck (2018) introduced calibrated recommendations with an elegant formulation. If a user has watched 70% romance and 30% action movies, their recommendations should be approximately 70% romance and 30% action.

The objective balances relevance against calibration:

$L^*_u = \arg\max_{L_u} \left[ (1-\lambda) \cdot \text{REL}(L_u) - \lambda \cdot \text{CL}(P_u, Q(L_u)) \right]$

Where:

$\text{REL}(L_u)$ = relevance score of recommendation list
$\text{CL}(P_u, Q(L_u))$ = calibration loss, measured as KL divergence between user history distribution $P_u$ and list distribution $Q$
$\lambda$ = trade-off parameter

KL divergence measures how one probability distribution differs from another—here, how much the recommendation list's genre distribution differs from the user's historical preferences. A KL divergence of 0 means perfect calibration.

Greedy reranking:

Since optimizing this exactly is NP-hard, Steck proposed a greedy algorithm:

def calibrated_rerank(candidates, user_genre_dist, lambda_param=0.5, k=10):
    """
    Greedy calibrated reranking.
    Start with empty list, iteratively add items that maximize
    relevance - lambda * calibration_loss.
    """
    result = []
    candidates = list(candidates)

    for _ in range(k):
        best_item = None
        best_score = float('-inf')

        for item in candidates:
            # Score = relevance - lambda * calibration_loss
            new_list = result + [item]
            list_dist = compute_genre_distribution(new_list)

            relevance = item['score']
            cal_loss = kl_divergence(user_genre_dist, list_dist)
            score = (1 - lambda_param) * relevance - lambda_param * cal_loss

            if score > best_score:
                best_score = score
                best_item = item

        result.append(best_item)
        candidates.remove(best_item)

    return result

This greedy approach achieves a $(1 - 1/e) \approx 0.63$ approximation guarantee—meaning the greedy solution is guaranteed to be at least 63% as good as the optimal solution.

5. Exploration Strategies

The core trade-off: Exploitation uses current knowledge to maximize immediate reward. Exploration gathers information to improve future recommendations.

Epsilon-greedy: With probability $\epsilon$ , recommend a random item instead of the predicted best.

def epsilon_greedy_recommend(user, model, epsilon=0.1, all_items=None):
    """Simple epsilon-greedy exploration."""
    if random.random() < epsilon:
        return random.choice(all_items)
    else:
        return model.recommend(user)

(+) Dead simple to implement
(-) Exploration is undirected - you might explore items you already know are bad

Upper Confidence Bound (UCB): Recommend items with high uncertainty. Score = predicted_relevance + exploration_bonus.

$\text{UCB}(i) = \hat{\mu}_i + c \cdot \sqrt{\frac{\log(n)}{n_i}}$

Where $\hat{\mu}_i$ is item $i$ 's estimated average reward, $n$ is the total number of recommendations made so far, $n_i$ is how many times item $i$ has been shown, and $c$ is an exploration constant (typically 1-2).

(+) Directed exploration toward uncertain items
(-) Requires tracking per-item statistics

Thompson Sampling: Maintain a posterior distribution over item quality. Sample from the posterior and recommend the sampled best.

(+) Naturally balances exploration and exploitation
(+) Theoretically well-grounded
(-) More complex to implement, especially for neural models

The cost of exploration: Every exploratory recommendation potentially shows a user a suboptimal item. This has real cost. Production teams typically find that exploration must be balanced against long-term user satisfaction.

In my experience, most production systems use modest exploration rates (1-5%) combined with other diversity mechanisms.

Which of these techniques should you actually use? It depends on which biases affect you most.

Practical Guide: Diagnosing Your System

Not all systems suffer from all biases equally. Here's a checklist to identify which ones affect you.

Selection Bias Indicators

Rating/interaction distribution heavily skewed positive
Users rate a small fraction of items they consume
Large gaps between logged consumption and logged feedback

Position Bias Indicators

CTR drops sharply with position (more than 30% from pos 1 to pos 2)
Top positions dominate click volume
Swapping items between positions changes their CTR significantly

Exposure Bias Indicators

Most items have very few or zero interactions
Item coverage in recommendations is low (under 20% of catalog)
New items struggle to get initial traction

Popularity Bias Indicators

Recommendation lists dominated by head items
Gini coefficient (a 0-to-1 measure of inequality in recommendation frequency) of recommended items is high (above 0.8)
Niche-interest users get mainstream recommendations

Conformity Bias Indicators

Visible ratings/engagement counts affect user behavior
Social features (likes, shares) drive interaction patterns
A/B tests show engagement counts influence click-through

Feedback Loop Indicators

Recommendation diversity decreasing over time
User behavior becoming more homogeneous
Offline metrics improving while satisfaction surveys worsen

Metrics to Track

Track these regularly to monitor bias health:

Catalog coverage: % of items recommended at least once
Gini coefficient: Inequality of item recommendation frequency. The Gini coefficient measures inequality in a distribution—0 means perfect equality (all items get equal exposure), 1 means complete inequality (one item gets everything).
Intra-list diversity: Average dissimilarity within recommendation lists
User-level personalization: How different are recommendations across users?

Quantifying Bias Severity

Here are some rough thresholds I use to assess bias severity:

Position Bias:

Mild: CTR drops less than 30% from position 1 to 2
Moderate: 30-50% drop
Severe: more than 50% drop or position 1 dominates more than 40% of all clicks

Popularity Bias:

Calculate what fraction of recommendations go to the top 1% of items by popularity
Mild: Top 1% gets under 20% of recommendations
Moderate: 20-40%
Severe: above 40%

Catalog Coverage:

Healthy: over 50% of catalog recommended at least once per week
Concerning: 20-50%
Critical: under 20% (long tail is effectively invisible)

Feedback Loop Detection:

Compare item popularity rankings month-over-month
If the same items consistently rank highest and ranking stability is increasing, feedback loops may be strengthening

def compute_gini_coefficient(recommendation_counts):
    """
    Compute Gini coefficient for recommendation frequency.
    0 = perfect equality (all items recommended equally)
    1 = perfect inequality (one item gets all recommendations)
    """
    sorted_counts = np.sort(recommendation_counts)
    n = len(sorted_counts)
    cumulative = np.cumsum(sorted_counts)
    gini = (2 * np.sum((np.arange(1, n+1) * sorted_counts))) / (n * np.sum(sorted_counts)) - (n + 1) / n
    return gini

# Example usage
item_rec_counts = get_recommendation_counts_by_item()
gini = compute_gini_coefficient(item_rec_counts)
print(f"Gini coefficient: {gini:.3f}")
# Healthy: < 0.6
# Concerning: 0.6 - 0.8
# Severe: > 0.8

Trade-offs and Limitations

Every debiasing technique involves trade-offs. Here's an honest assessment.

Inverse Propensity Scoring (IPS)

(+) Theoretically principled, provides unbiased estimates
(+) Modular - can be applied to any model
(-) Requires knowing or estimating propensities accurately
(-) High variance when propensities are small
(-) Clipping or normalization introduces some bias

Position Debiasing

(+) Well-understood problem with established solutions
(+) Propensities can be estimated via randomization
(-) Requires some exploration (showing suboptimal items) to estimate propensities
(-) Click models make simplifying assumptions that may not hold

Calibration

(+) Interpretable - matches user's historical preferences
(+) Post-processing, works with any upstream model
(-) Only addresses popularity bias, not other biases
(-) "Historical preferences" may themselves be biased
(-) Trade-off parameter requires tuning

Exploration

(+) Directly addresses feedback loops
(+) Gathers data on neglected items
(-) Every exploratory recommendation has opportunity cost
(-) May hurt short-term engagement metrics

General Limitations

Propensity estimation is hard. Most methods assume you can estimate propensities accurately, but in practice you're estimating from the same biased data you're trying to correct.

Trade-offs between biases. Fixing one bias can exacerbate another. Increasing exploration to reduce feedback loops might increase exposure bias for specific user sessions.

Evaluation remains challenging. Even with debiasing, offline evaluation is approximate. The only true test is online A/B testing with business metrics.

Computational cost. Many debiasing methods add complexity. Propensity weighting changes your loss landscape. Calibrated reranking is an additional post-processing step.

Conclusion

Bias in recommender systems isn't a bug to be eliminated - it's a fundamental property of observational data that must be understood and managed.

The key takeaways:

Observational data is not experimental data. Your training signal reflects system behavior and user self-selection, not just true preferences.
Six biases matter most: selection, position, exposure, popularity, conformity, and feedback loops. Different systems have different severity profiles.
Offline metrics can mislead. A model that fits biased data well will score well on biased test data but may disappoint online.
IPS provides a principled framework. Inverse propensity scoring lets you reweight observations to approximate the true distribution, though variance is a concern.
Position bias is well-understood. The examination hypothesis and propensity estimation methods work well when you can run controlled experiments.
Feedback loops compound over time. Small biases amplify with retraining. Monitor diversity metrics and consider exploration.
Every mitigation involves trade-offs. There's no free lunch. Choose techniques based on your specific bias profile and constraints.

The next time your offline metrics look great but online performance disappoints, don't blame the model—examine the data. Somewhere in those millions of clicks is a systematic distortion between what you observed and what users actually want.

The six biases we covered aren't bugs to eliminate. They're properties of observational data to understand and manage. The tools exist: IPS, position debiasing, calibration, exploration. Use them thoughtfully, measure the trade-offs, and remember—the goal isn't a perfect model. It's a model that actually serves users better.

The Fundamental Problem: Observational Data

Here's the core issue: user behavior data is observational, not experimental.

As Chen et al. put it in their 2020 survey: "Blindly fitting the data without considering the inherent biases will result in many serious issues."

Users who like Star Wars genuinely also like Marvel (true preference)
Users who like Star Wars were only ever shown Marvel as the next option (exposure bias)

In experimental data, you control what users see and measure their responses. In observational data, users self-select what they interact with, and you only see the outcome of that selection process.

What Is Bias in Recommender Systems?

In this context, bias means a systematic deviation between what you observe and what's actually true about user preferences.

I find it useful to categorize biases by where they enter the system:

Data biases (User -> Data): Selection, exposure, conformity, position
Model biases (Data -> Model): Inductive bias (often intentional)
Result biases (Model -> User): Popularity bias, unfairness

The dangerous part is that biases in the output become biases in the input for the next training cycle. This feedback loop can cause small initial errors to compound geometrically.

With this framework in mind, let's examine each bias in detail.

The Six Major Biases

1. Selection Bias (Missing Not At Random)

The problem: Users choose what to interact with. Unobserved interactions are not random.

Let $Y$ indicate whether the user would truly like the item if they saw it (their true preference). Mathematically, we can express selection bias as:

$P(\text{observe} \mid Y=1) \neq P(\text{observe} \mid Y=0)$

In other words, the probability of observing an interaction depends on whether the user would have liked the item.

Why it matters: If you train a model treating unobserved interactions as negatives (which many implicit feedback models do), you're assuming users rejected items they may never have seen.

Common Mistake

Many practitioners treat all unobserved user-item pairs as negative examples. This conflates "didn't interact" with "wouldn't like if seen." The two are very different.

2. Position Bias

The problem: Items shown higher get more clicks regardless of quality.

The standard model decomposes clicks into two independent factors:

$P(\text{click}) = P(\text{examine} \mid \text{position}) \times P(\text{click} \mid \text{examine, relevance})$

This is called the examination hypothesis. A user must first examine an item, then decide whether to click based on its relevance.

# Rough model of position bias
# Examination probability decays with position
def examination_prob(position, decay=0.8):
    """P(examine | position) modeled here with exponential/geometric decay."""
    return decay ** (position - 1)

# Example: positions 1-5
for pos in range(1, 6):
    print(f"Position {pos}: {examination_prob(pos):.2%} examination rate")

# Position 1: 100.00% examination rate
# Position 2: 80.00% examination rate
# Position 3: 64.00% examination rate
# Position 4: 51.20% examination rate
# Position 5: 40.96% examination rate

Why it matters: If you train on click data without accounting for position, you learn "items in position 1 are better" rather than "these items are genuinely preferred."

3. Exposure Bias

The problem: Users can only interact with items they're exposed to.

Two flavors:

System-caused: Your ranker didn't surface the item
User-caused: The user didn't scroll far enough, or visited on a device with limited screen real estate

4. Popularity Bias (Rich-Get-Richer)

The problem: Popular items get recommended more, which makes them more popular.

This is a result bias (it affects recommendations, not just data collection), but it creates feedback loops that amplify over time. The dynamic works like this:

Item A is slightly more popular initially
Model learns A is popular, recommends it more
A gets more clicks, becomes more popular
Model learns A is even more popular...

The problem: Users modify their behavior based on others' opinions.

This means "feedback does not always signify user true preference" (Chen et al., 2020). You're observing socially-influenced behavior, not pure preference.

Manifestations:

Rating anchoring: Visible ratings shift subsequent ratings toward the anchor
Herd behavior: High engagement counts attract more engagement
Social proof: "Others who bought this..." creates self-fulfilling prophecies

6. Feedback Loop Bias

The problem: Your model trains on its own biased outputs.

This is the meta-bias that amplifies all the others. Every time you retrain on user interactions that were shaped by previous recommendations, you compound existing biases.

The simulation results show clear patterns:

With a single training cycle, homogenization was temporary and moderate
With repeated training (retraining every iteration), homogenization became persistent and increased over time
Popularity-based recommendations caused the most global homogenization
Even content-based filtering contributed to the effect

The Compounding Problem

Understanding the six biases is the first step. The next question is: where do they actually hurt?

How Biases Affect Your System

Understanding where biases hurt helps prioritize which ones to address. The impact varies across the ML lifecycle.

Training: Learning Correlations, Not Causes

When your model fits biased data, it learns spurious correlations instead of causal relationships.

The model becomes a self-fulfilling prophecy machine, not a preference discovery system.

Evaluation: Offline Metrics Mislead

But when you deploy, you're serving recommendations to the true distribution of user preferences. The mismatch shows up as: "We improved AUC by 2% but engagement dropped."

As Chen et al. note: "The discrepancy between offline evaluation and online metrics hurts user satisfaction and trust."

Why this happens:

Test data has the same position bias as training data - items in position 1 "look" better
Test data reflects historical exposure patterns - unexposed items have no ground truth
Test data inherits popularity skew - head items have more signal, look more predictable

A model that learns these biases fits the test distribution well. But that's not the distribution you care about serving.

Deployment: Winner's Curse and Policy Mismatch

This is why counterfactual evaluation methods (IPS, doubly robust estimators) exist - to estimate what would have happened under a different policy using data collected under the current policy.

Long-Term: Catalog Atrophy and User Fatigue

Over time, feedback loops cause:

Catalog atrophy: Items without exposure never get signal, so they never get recommended, so they never get exposure. New items and niche items slowly disappear from the system's effective catalog.
Filter bubbles: Users see narrower content over time, reducing discovery. A user who clicked on one political video might get pulled deeper into that ideology.
User fatigue: Repetitive recommendations lead to disengagement. Users stop trusting recommendations and rely more on search or external discovery.
Creator/supplier exodus: If content creators or product suppliers see their items never getting recommended, they leave the platform, reducing catalog quality.

These are slow-moving problems that won't show up in two-week A/B tests but erode system health over months and years. By the time you notice, the damage is done.

Now that we understand the problem, let's look at solutions.

Mitigation Strategies

1. Inverse Propensity Scoring (IPS)

The intuition: Reweight samples to "undo" the biased collection process.

For a given observation $(u, i)$ with observed outcome $Y_{ui}$ , the propensity is:

$P_{ui} = P(\text{observe } (u,i))$

$\hat{R}_{\text{IPS}} = \frac{1}{|U| \cdot |I|} \sum_{(u,i): \text{observed}} \frac{\delta(Y_{ui}, \hat{Y}_{ui})}{P_{ui}}$

Schnabel et al. (2016) showed this estimator is unbiased when propensities are correctly specified.

The variance problem: IPS has a major practical issue. When propensities are small (rare observations), the weights become huge, leading to high variance estimates.

Solutions:

# Clipped IPS - cap the maximum weight
def clipped_ips_weight(propensity, max_weight=10.0):
    """Clip IPS weights to reduce variance."""
    return min(1.0 / propensity, max_weight)

# Self-Normalized IPS (SNIPS) - normalize by sum of weights
def snips_estimator(observations, propensities, outcomes):
    """
    SNIPS normalizes by the sum of importance weights,
    providing lower variance at the cost of slight bias.
    """
    weights = 1.0 / propensities
    weighted_outcomes = weights * outcomes
    return weighted_outcomes.sum() / weights.sum()

SNIPS (Self-Normalized IPS) is often preferred in practice because it's more stable, though it introduces slight bias.

Practical Advice

Doubly Robust Methods:

What if your propensity estimates are wrong? Doubly robust estimators combine IPS with an imputation model to be more robust to misspecification.

The idea is to use two components:

$\hat{Y}_{ui}$ : a predicted outcome from an imputation model (what we think the rating would be if we had observed it). In practice, this is often just your recommendation model's prediction—you use the same model for both the baseline estimate and the correction term.
$O_{ui}$ : an indicator that equals 1 if we observed pair $(u,i)$ , and 0 otherwise

The estimator uses the imputation as a baseline and applies an IPS correction where we have actual observations:

$\hat{R}_{\text{DR}} = \frac{1}{n} \sum_{(u,i)} \left[ \hat{Y}_{ui} + \frac{O_{ui}(Y_{ui} - \hat{Y}_{ui})}{P_{ui}} \right]$

The key property: this estimator is unbiased if either the propensity model or the imputation model is correct. You get two chances to be right.

2. Position Debiasing

Propensity estimation for position bias:

The cleanest approach is controlled experiments. Joachims et al. (2017) proposed a minimal intervention: swap documents between positions and measure CTR changes.

$\frac{P(\text{examine} \mid i)}{P(\text{examine} \mid j)} = \frac{\text{CTR}(\text{item at } i \text{ swapped from } j)}{\text{CTR}(\text{item at } j \text{ swapped from } i)}$

This reveals relative examination probabilities without destroying user experience.

Click models:

Alternatively, you can fit a click model to logged data. The Position-Based Model (PBM) assumes:

$P(\text{click} \mid \text{position}, \text{item}) = P(\text{examine} \mid \text{position}) \times P(\text{relevant} \mid \text{item})$

More sophisticated models account for user behavior dynamics:

Cascade Model: Users scan from top to bottom, stopping after a click. Examination probability depends on whether previous items were clicked.
Dynamic Bayesian Network (DBN): Models satisfaction - users might click but not be satisfied, then continue scanning.
User Browsing Model (UBM): Accounts for users returning to scan earlier positions after skipping.

Estimating position propensities without experiments:

# Fit position propensities from logged data
# Assume: log(P(click)) = log(P(examine|pos)) + log(P(relevant|item))
# We can model this as: logit(click) ~ position_features + item_features

from sklearn.linear_model import LogisticRegression

def estimate_position_propensities(click_data):
    """
    Estimate relative position propensity scores from logged clicks.
    Returns log-odds coefficients (not absolute probabilities).
    Exponentiate the coefficients for use with IPS loss weighting.
    Assumes item relevance is independent of position assignment.
    """
    # One-hot encode positions
    position_features = pd.get_dummies(click_data['position'], prefix='pos')

    # Include item features to control for relevance
    item_features = click_data[['predicted_score', 'item_popularity']]

    X = pd.concat([position_features, item_features], axis=1)
    y = click_data['clicked']

    model = LogisticRegression()
    model.fit(X, y)

    # Extract position coefficients
    pos_coefs = {col: model.coef_[0][i]
                 for i, col in enumerate(position_features.columns)}

    return pos_coefs

This is noisier than experimental estimation but often good enough in practice.

Position debiasing in training:

Once you have position propensities, you can debias during training by weighting losses:

def position_debiased_loss(predictions, labels, positions, position_propensities):
    """
    Weight each example by inverse position propensity.
    Items in low-examination positions get upweighted.
    """
    weights = 1.0 / position_propensities[positions]
    # Optionally clip weights to reduce variance
    weights = torch.clamp(weights, max=10.0)

    per_example_loss = F.binary_cross_entropy(predictions, labels, reduction='none')
    return (per_example_loss * weights).mean()

3. Exposure-Aware Training

Negative sampling strategies:

Most implicit feedback models need negative examples. Instead of treating all unobserved items as negatives (which conflates "not exposed" with "not preferred"), you can:

Uniform negative sampling: Sample negatives uniformly from non-interacted items
Popularity-weighted sampling: Sample proportional to item popularity (oversamples items users likely saw but didn't click)
Exposure-weighted sampling: Use logged exposure data to sample items the user actually saw but didn't interact with

def exposure_aware_negative_sampling(user, positive_items, exposure_log, n_negatives):
    """
    Sample negatives from items the user was exposed to but didn't interact with.
    Falls back to popularity sampling for users with limited exposure data.
    """
    exposed_items = exposure_log.get(user, set())
    candidate_negatives = exposed_items - set(positive_items)

    if len(candidate_negatives) >= n_negatives:
        return random.sample(list(candidate_negatives), n_negatives)
    else:
        # Combine available candidates with fallback samples
        available = list(candidate_negatives)
        needed = n_negatives - len(available)
        return available + popularity_weighted_sample(needed)

Exposure-weighted losses:

If you know exposure probabilities, you can incorporate them directly into the loss:

$L = \sum_{(u,i) \in \text{positives}} -\log(\sigma(s_{ui})) + \sum_{(u,j) \in \text{negatives}} w_j \cdot (-\log(1 - \sigma(s_{uj})))$

4. Popularity Calibration

The idea: Ensure recommendations reflect the user's actual preference distribution, not a popularity-skewed version.

The objective balances relevance against calibration:

$L^*_u = \arg\max_{L_u} \left[ (1-\lambda) \cdot \text{REL}(L_u) - \lambda \cdot \text{CL}(P_u, Q(L_u)) \right]$

Where:

$\text{REL}(L_u)$ = relevance score of recommendation list
$\text{CL}(P_u, Q(L_u))$ = calibration loss, measured as KL divergence between user history distribution $P_u$ and list distribution $Q$
$\lambda$ = trade-off parameter

Greedy reranking:

Since optimizing this exactly is NP-hard, Steck proposed a greedy algorithm:

def calibrated_rerank(candidates, user_genre_dist, lambda_param=0.5, k=10):
    """
    Greedy calibrated reranking.
    Start with empty list, iteratively add items that maximize
    relevance - lambda * calibration_loss.
    """
    result = []
    candidates = list(candidates)

    for _ in range(k):
        best_item = None
        best_score = float('-inf')

        for item in candidates:
            # Score = relevance - lambda * calibration_loss
            new_list = result + [item]
            list_dist = compute_genre_distribution(new_list)

            relevance = item['score']
            cal_loss = kl_divergence(user_genre_dist, list_dist)
            score = (1 - lambda_param) * relevance - lambda_param * cal_loss

            if score > best_score:
                best_score = score
                best_item = item

        result.append(best_item)
        candidates.remove(best_item)

    return result

This greedy approach achieves a $(1 - 1/e) \approx 0.63$ approximation guarantee—meaning the greedy solution is guaranteed to be at least 63% as good as the optimal solution.

5. Exploration Strategies

The core trade-off: Exploitation uses current knowledge to maximize immediate reward. Exploration gathers information to improve future recommendations.

Epsilon-greedy: With probability $\epsilon$ , recommend a random item instead of the predicted best.

def epsilon_greedy_recommend(user, model, epsilon=0.1, all_items=None):
    """Simple epsilon-greedy exploration."""
    if random.random() < epsilon:
        return random.choice(all_items)
    else:
        return model.recommend(user)

(+) Dead simple to implement
(-) Exploration is undirected - you might explore items you already know are bad

Upper Confidence Bound (UCB): Recommend items with high uncertainty. Score = predicted_relevance + exploration_bonus.

$\text{UCB}(i) = \hat{\mu}_i + c \cdot \sqrt{\frac{\log(n)}{n_i}}$

(+) Directed exploration toward uncertain items
(-) Requires tracking per-item statistics

Thompson Sampling: Maintain a posterior distribution over item quality. Sample from the posterior and recommend the sampled best.

(+) Naturally balances exploration and exploitation
(+) Theoretically well-grounded
(-) More complex to implement, especially for neural models

In my experience, most production systems use modest exploration rates (1-5%) combined with other diversity mechanisms.

Which of these techniques should you actually use? It depends on which biases affect you most.

Practical Guide: Diagnosing Your System

Not all systems suffer from all biases equally. Here's a checklist to identify which ones affect you.

Selection Bias Indicators

Rating/interaction distribution heavily skewed positive
Users rate a small fraction of items they consume
Large gaps between logged consumption and logged feedback

Position Bias Indicators

CTR drops sharply with position (more than 30% from pos 1 to pos 2)
Top positions dominate click volume
Swapping items between positions changes their CTR significantly

Exposure Bias Indicators

Most items have very few or zero interactions
Item coverage in recommendations is low (under 20% of catalog)
New items struggle to get initial traction

Popularity Bias Indicators

Recommendation lists dominated by head items
Gini coefficient (a 0-to-1 measure of inequality in recommendation frequency) of recommended items is high (above 0.8)
Niche-interest users get mainstream recommendations

Conformity Bias Indicators

Visible ratings/engagement counts affect user behavior
Social features (likes, shares) drive interaction patterns
A/B tests show engagement counts influence click-through

Feedback Loop Indicators

Recommendation diversity decreasing over time
User behavior becoming more homogeneous
Offline metrics improving while satisfaction surveys worsen

Metrics to Track

Track these regularly to monitor bias health:

Catalog coverage: % of items recommended at least once
Gini coefficient: Inequality of item recommendation frequency. The Gini coefficient measures inequality in a distribution—0 means perfect equality (all items get equal exposure), 1 means complete inequality (one item gets everything).
Intra-list diversity: Average dissimilarity within recommendation lists
User-level personalization: How different are recommendations across users?

Quantifying Bias Severity

Here are some rough thresholds I use to assess bias severity:

Position Bias:

Mild: CTR drops less than 30% from position 1 to 2
Moderate: 30-50% drop
Severe: more than 50% drop or position 1 dominates more than 40% of all clicks

Popularity Bias:

Calculate what fraction of recommendations go to the top 1% of items by popularity
Mild: Top 1% gets under 20% of recommendations
Moderate: 20-40%
Severe: above 40%

Catalog Coverage:

Healthy: over 50% of catalog recommended at least once per week
Concerning: 20-50%
Critical: under 20% (long tail is effectively invisible)

Feedback Loop Detection:

Compare item popularity rankings month-over-month
If the same items consistently rank highest and ranking stability is increasing, feedback loops may be strengthening

def compute_gini_coefficient(recommendation_counts):
    """
    Compute Gini coefficient for recommendation frequency.
    0 = perfect equality (all items recommended equally)
    1 = perfect inequality (one item gets all recommendations)
    """
    sorted_counts = np.sort(recommendation_counts)
    n = len(sorted_counts)
    cumulative = np.cumsum(sorted_counts)
    gini = (2 * np.sum((np.arange(1, n+1) * sorted_counts))) / (n * np.sum(sorted_counts)) - (n + 1) / n
    return gini

# Example usage
item_rec_counts = get_recommendation_counts_by_item()
gini = compute_gini_coefficient(item_rec_counts)
print(f"Gini coefficient: {gini:.3f}")
# Healthy: < 0.6
# Concerning: 0.6 - 0.8
# Severe: > 0.8

Trade-offs and Limitations

Every debiasing technique involves trade-offs. Here's an honest assessment.

Inverse Propensity Scoring (IPS)

(+) Theoretically principled, provides unbiased estimates
(+) Modular - can be applied to any model
(-) Requires knowing or estimating propensities accurately
(-) High variance when propensities are small
(-) Clipping or normalization introduces some bias

Position Debiasing

(+) Well-understood problem with established solutions
(+) Propensities can be estimated via randomization
(-) Requires some exploration (showing suboptimal items) to estimate propensities
(-) Click models make simplifying assumptions that may not hold

Calibration

(+) Interpretable - matches user's historical preferences
(+) Post-processing, works with any upstream model
(-) Only addresses popularity bias, not other biases
(-) "Historical preferences" may themselves be biased
(-) Trade-off parameter requires tuning

Exploration

(+) Directly addresses feedback loops
(+) Gathers data on neglected items
(-) Every exploratory recommendation has opportunity cost
(-) May hurt short-term engagement metrics

General Limitations

Propensity estimation is hard. Most methods assume you can estimate propensities accurately, but in practice you're estimating from the same biased data you're trying to correct.

Trade-offs between biases. Fixing one bias can exacerbate another. Increasing exploration to reduce feedback loops might increase exposure bias for specific user sessions.

Evaluation remains challenging. Even with debiasing, offline evaluation is approximate. The only true test is online A/B testing with business metrics.

Computational cost. Many debiasing methods add complexity. Propensity weighting changes your loss landscape. Calibrated reranking is an additional post-processing step.

Conclusion

Bias in recommender systems isn't a bug to be eliminated - it's a fundamental property of observational data that must be understood and managed.

The key takeaways:

Observational data is not experimental data. Your training signal reflects system behavior and user self-selection, not just true preferences.
Six biases matter most: selection, position, exposure, popularity, conformity, and feedback loops. Different systems have different severity profiles.
Offline metrics can mislead. A model that fits biased data well will score well on biased test data but may disappoint online.
IPS provides a principled framework. Inverse propensity scoring lets you reweight observations to approximate the true distribution, though variance is a concern.
Position bias is well-understood. The examination hypothesis and propensity estimation methods work well when you can run controlled experiments.
Feedback loops compound over time. Small biases amplify with retraining. Monitor diversity metrics and consider exploration.
Every mitigation involves trade-offs. There's no free lunch. Choose techniques based on your specific bias profile and constraints.

The Fundamental Problem: Observational Data

What Is Bias in Recommender Systems?

The Six Major Biases

1. Selection Bias (Missing Not At Random)

2. Position Bias

3. Exposure Bias

4. Popularity Bias (Rich-Get-Richer)

5. Conformity Bias (Social Influence)

6. Feedback Loop Bias

How Biases Affect Your System

Training: Learning Correlations, Not Causes

Evaluation: Offline Metrics Mislead

Deployment: Winner's Curse and Policy Mismatch

Long-Term: Catalog Atrophy and User Fatigue

Mitigation Strategies

1. Inverse Propensity Scoring (IPS)

2. Position Debiasing

3. Exposure-Aware Training

4. Popularity Calibration

5. Exploration Strategies

Practical Guide: Diagnosing Your System

Selection Bias Indicators

Position Bias Indicators

Exposure Bias Indicators

Popularity Bias Indicators

Conformity Bias Indicators

Feedback Loop Indicators

Quantifying Bias Severity

Trade-offs and Limitations

Inverse Propensity Scoring (IPS)

Position Debiasing

Calibration

Exploration

General Limitations

Conclusion

Further Reading

The Fundamental Problem: Observational Data

What Is Bias in Recommender Systems?

The Six Major Biases

1. Selection Bias (Missing Not At Random)

2. Position Bias

3. Exposure Bias

4. Popularity Bias (Rich-Get-Richer)

5. Conformity Bias (Social Influence)

6. Feedback Loop Bias

How Biases Affect Your System

Training: Learning Correlations, Not Causes

Evaluation: Offline Metrics Mislead

Deployment: Winner's Curse and Policy Mismatch

Long-Term: Catalog Atrophy and User Fatigue

Mitigation Strategies

1. Inverse Propensity Scoring (IPS)

2. Position Debiasing

3. Exposure-Aware Training

4. Popularity Calibration

5. Exploration Strategies

Practical Guide: Diagnosing Your System

Selection Bias Indicators

Position Bias Indicators

Exposure Bias Indicators

Popularity Bias Indicators

Conformity Bias Indicators

Feedback Loop Indicators

Quantifying Bias Severity

Trade-offs and Limitations

Inverse Propensity Scoring (IPS)

Position Debiasing

Calibration

Exploration

General Limitations

Conclusion

Further Reading