Your model trained. AUC (Area Under the ROC Curve) improved by 2%. You deployed it. Live metrics went sideways.
If you've built recommender systems, you've probably experienced this frustrating disconnect between offline and online performance. The culprit is often bias in your training data. Not "bias" in the ethical sense (though that matters too), but systematic distortions between what you observe and what users actually prefer.
This article covers the six major biases in recommender systems, why they matter, and what you can do about them. After reading, you'll be able to:
Here's the core issue: user behavior data is observational, not experimental.
When a user clicks on an item, you observe it. When they don't click, you observe that too. But you have no idea why they didn't click. Did they see the item and dislike it? Did they never see it at all? Were they in a hurry and only looked at the top result?
As Chen et al. put it in their 2020 survey: "Blindly fitting the data without considering the inherent biases will result in many serious issues."
Here's the core problem: if you observe that users who clicked on "Star Wars" also clicked on "Marvel movies," you might conclude these users like blockbuster action. But maybe your recommendation system just kept showing them blockbusters. You observed a correlation between "user clicked Star Wars" and "user clicked Marvel," but you can't distinguish between:
Without running an experiment where you randomly show different users different options, you can't tell which explanation is true. This is why we need debiasing—to recover signal from confounded data.
The problems compound when you realize that your training data was generated by a previous recommendation system. You're training on data that reflects past model behavior, not just user preferences. This creates circular feedback that can amplify errors over time.
In experimental data, you control what users see and measure their responses. In observational data, users self-select what they interact with, and you only see the outcome of that selection process.
In this context, bias means a systematic deviation between what you observe and what's actually true about user preferences.
The causal perspective helps here. Schnabel et al. (2016) framed it elegantly: "Exposing a user to an item in a recommendation system is an intervention analogous to exposing a patient to a treatment in a medical study."
Just like medical researchers can't simply observe who takes medication (sicker people might self-select), you can't simply observe who clicks on items (exposure, position, and social influence all confound the signal).
I find it useful to categorize biases by where they enter the system:
The dangerous part is that biases in the output become biases in the input for the next training cycle. This feedback loop can cause small initial errors to compound geometrically.
With this framework in mind, let's examine each bias in detail.
The problem: Users choose what to interact with. Unobserved interactions are not random.
This is probably the most fundamental bias. When you see a rating, it's because a user chose to rate that item. Users tend to rate items they like, items they feel strongly about, or items they've actually encountered. The missing ratings are not "Missing Completely At Random" (MCAR) - they're "Missing Not At Random" (MNAR).
Let indicate whether the user would truly like the item if they saw it (their true preference). Mathematically, we can express selection bias as:
In other words, the probability of observing an interaction depends on whether the user would have liked the item.
Why it matters: If you train a model treating unobserved interactions as negatives (which many implicit feedback models do), you're assuming users rejected items they may never have seen.
Common Mistake
Many practitioners treat all unobserved user-item pairs as negative examples. This conflates "didn't interact" with "wouldn't like if seen." The two are very different.
Real-world evidence: Research on explicit rating data consistently shows that rating distributions skew positive (Chen et al., 2020). On Netflix, 75% of ratings are 4 stars or higher. Not because 75% of movies are good—but because people rate movies they liked. The missing ratings (the movies users started and stopped after 5 minutes) tell a very different story that never makes it into your training data.
The problem: Items shown higher get more clicks regardless of quality.
This one is well-documented and substantial. Joachims et al. (2017) showed that "the order of presentation has a strong influence on where users click." Users trust the first few results and often don't examine lower positions at all.
The standard model decomposes clicks into two independent factors:
This is called the examination hypothesis. A user must first examine an item, then decide whether to click based on its relevance.
How bad is it? In my experience with search and recommendation systems, CTR (click-through rate) typically drops 30-50% between position 1 and position 2, and continues declining roughly logarithmically. Position 10 might get 10x fewer clicks than position 1, even for equally relevant items.
# Rough model of position bias
# Examination probability decays with position
def examination_prob(position, decay=0.8):
"""P(examine | position) modeled here with exponential/geometric decay."""
return decay ** (position - 1)
# Example: positions 1-5
for pos in range(1, 6):
print(f"Position {pos}: {examination_prob(pos):.2%} examination rate")
# Position 1: 100.00% examination rate
# Position 2: 80.00% examination rate
# Position 3: 64.00% examination rate
# Position 4: 51.20% examination rate
# Position 5: 40.96% examination rate
Why it matters: If you train on click data without accounting for position, you learn "items in position 1 are better" rather than "these items are genuinely preferred."
The problem: Users can only interact with items they're exposed to.
This is closely related to selection bias, but the cause is different. Selection bias comes from user choice; exposure bias comes from system choice. Your previous recommendation model decided what to show. Items that weren't shown couldn't be clicked, no matter how relevant they were.
Two flavors:
The consequence is that observed interactions are always a biased subset of potential interactions. Items with high historical exposure have more opportunities to receive feedback, creating a data imbalance.
Impact on catalog coverage: At scale, exposure bias often means the majority of interactions come from a small fraction of the catalog—power-law distributions (where a small number of items receive the vast majority of interactions) are typical. The "long tail" of items gets almost no signal, making it impossible to learn whether they're good or not.
The problem: Popular items get recommended more, which makes them more popular.
This is a result bias (it affects recommendations, not just data collection), but it creates feedback loops that amplify over time. The dynamic works like this:
Abdollahpouri et al. (2024) define popularity bias with an impact-oriented framing: "A recommender system faces issues of popularity bias when the recommendations focus on popular items to the extent that they limit the value of the system or create harm."
The key insight is that recommending popular items isn't inherently bad. Popular items are often genuinely good. The problem is when popularity overwhelms other signals, reducing personalization and suppressing potentially relevant niche items.
The long-tail distribution: Most recommendation datasets follow a power-law distribution where a small "head" of items receives the vast majority of interactions. Training on this data produces models that perpetuate and amplify this imbalance.
The problem: Users modify their behavior based on others' opinions.
When users see that an item has 4.8 stars, they rate it higher than they would have otherwise. When they see a video has 10 million views, they're more likely to click and more likely to enjoy it (or at least report enjoying it).
This means "feedback does not always signify user true preference" (Chen et al., 2020). You're observing socially-influenced behavior, not pure preference.
Manifestations:
Why it's tricky: Conformity bias is partially legitimate signal. If everyone likes something, it probably is good. But it's also partially noise that reduces the diversity of feedback you receive.
The problem: Your model trains on its own biased outputs.
This is the meta-bias that amplifies all the others. Every time you retrain on user interactions that were shaped by previous recommendations, you compound existing biases.
Chaney et al. (2018) demonstrated this with careful simulations. They set up a community of 100 users interacting with items over 1,000 time intervals, testing different recommendation algorithms. Their key finding: "Recommendation systems that include repeated training homogenize user behavior more than is needed for ideal utility."
The simulation results show clear patterns:
The homogenization effect: Users exposed to similar recommendations start behaving more similarly. This reduces diversity in training data, which reduces diversity in recommendations, which further reduces behavioral diversity. It's a convergence spiral.
Who gets hurt? Users with minority preferences suffer disproportionately. As Chaney et al. observed: "Users who experience losses in utility generally have higher homogenization with their nearest neighbor." If your preferences don't match the majority, recommendations push you toward mainstream items, and your true preferences get less signal in the training data.
The algorithmic confounding problem: It becomes difficult to tell which user interactions stem from true preferences and which are influenced by recommendations. Your training data is contaminated by your own model's behavior, making it hard to learn anything genuinely new about user preferences.
The Compounding Problem
Feedback loops don't just maintain bias - they amplify it. Small initial biases can compound significantly over several retraining cycles. This is why periodic re-evaluation with fresh data is essential.
Understanding the six biases is the first step. The next question is: where do they actually hurt?
Understanding where biases hurt helps prioritize which ones to address. The impact varies across the ML lifecycle.
When your model fits biased data, it learns spurious correlations instead of causal relationships.
Example: If items in position 1 always get clicked, the model might learn "position 1 items are good" rather than "these specific items are good." If popular items dominate the training set, the model learns "recommend popular items" rather than learning to personalize.
The fundamental issue is distribution mismatch. You're training on but you want to predict , where represents user and item features, and represents the outcome (click, rating, purchase, etc.). These distributions differ systematically.
Concrete example: Consider a music recommendation system. Users who listen to Taylor Swift also tend to click on other pop artists, but that's partly because the system keeps showing them pop. If you train on this data naively, you'll learn "Taylor Swift fans like pop" when the truth might be "Taylor Swift fans would enjoy indie folk if they ever heard it."
The model becomes a self-fulfilling prophecy machine, not a preference discovery system.
This is where many teams get burned. Your offline evaluation uses a test set drawn from the same biased distribution as your training data. A model that learns biases well will score well on this test set - it's predicting the biased outcomes accurately!
But when you deploy, you're serving recommendations to the true distribution of user preferences. The mismatch shows up as: "We improved AUC by 2% but engagement dropped."
As Chen et al. note: "The discrepancy between offline evaluation and online metrics hurts user satisfaction and trust."
Why this happens:
A model that learns these biases fits the test distribution well. But that's not the distribution you care about serving.
Even if your model is better than the current production system, estimating how much better is hard. This is the counterfactual evaluation problem: you can't directly observe what would have happened under a different policy.
The "winner's curse" compounds this. If you test multiple models and select the one with the highest estimated improvement, you're likely overestimating its true improvement due to evaluation variance.
Policy mismatch in practice: Your training data was generated by policy A (the old model). You're evaluating policy B (the new model). But the outcomes you observe for policy B's recommendations were never collected under policy A. You're extrapolating into regions of the feature space with no data support.
This is why counterfactual evaluation methods (IPS, doubly robust estimators) exist - to estimate what would have happened under a different policy using data collected under the current policy.
Over time, feedback loops cause:
These are slow-moving problems that won't show up in two-week A/B tests but erode system health over months and years. By the time you notice, the damage is done.
Now that we understand the problem, let's look at solutions.
The intuition: Reweight samples to "undo" the biased collection process.
If you know the probability that each observation would be collected (the propensity), you can upweight unlikely observations and downweight likely ones. This creates an unbiased estimate of the true distribution.
Think of propensity as answering: "How likely was it that we would have observed this particular interaction?" If an item only appears 10% of the time, observations of it are "surprising" and should count more heavily.
For a given observation with observed outcome , the propensity is:
The IPS estimator uses these propensities to correct for biased sampling. Let represent any evaluation metric (e.g., squared error for rating prediction, binary cross-entropy for clicks), and be the total number of possible user-item pairs:
Schnabel et al. (2016) showed this estimator is unbiased when propensities are correctly specified.
The variance problem: IPS has a major practical issue. When propensities are small (rare observations), the weights become huge, leading to high variance estimates.
Solutions:
# Clipped IPS - cap the maximum weight
def clipped_ips_weight(propensity, max_weight=10.0):
"""Clip IPS weights to reduce variance."""
return min(1.0 / propensity, max_weight)
# Self-Normalized IPS (SNIPS) - normalize by sum of weights
def snips_estimator(observations, propensities, outcomes):
"""
SNIPS normalizes by the sum of importance weights,
providing lower variance at the cost of slight bias.
"""
weights = 1.0 / propensities
weighted_outcomes = weights * outcomes
return weighted_outcomes.sum() / weights.sum()
SNIPS (Self-Normalized IPS) is often preferred in practice because it's more stable, though it introduces slight bias.
Practical Advice
You rarely know true propensities. Estimate them from your logging data - impression counts are often a reasonable proxy for exposure probability. Even rough estimates help significantly compared to ignoring bias entirely.
Doubly Robust Methods:
What if your propensity estimates are wrong? Doubly robust estimators combine IPS with an imputation model to be more robust to misspecification.
The idea is to use two components:
The estimator uses the imputation as a baseline and applies an IPS correction where we have actual observations:
The key property: this estimator is unbiased if either the propensity model or the imputation model is correct. You get two chances to be right.
In practice, doubly robust methods often outperform pure IPS, especially when propensities are estimated noisily. The imputation model provides a reasonable baseline, and IPS corrects where you have data.
Propensity estimation for position bias:
The cleanest approach is controlled experiments. Joachims et al. (2017) proposed a minimal intervention: swap documents between positions and measure CTR changes.
To estimate the ratio of examination probabilities between positions and , run a randomization experiment where you swap items between those positions, then measure the click-through rates in each position:
This reveals relative examination probabilities without destroying user experience.
Click models:
Alternatively, you can fit a click model to logged data. The Position-Based Model (PBM) assumes:
More sophisticated models account for user behavior dynamics:
These models are more realistic but harder to fit. In my experience, PBM works well enough for most applications and is much easier to implement. The more sophisticated models matter most when user behavior is genuinely sequential (like search results) rather than parallel (like a grid of recommendations).
Estimating position propensities without experiments:
If you can't run swap experiments, you can estimate propensities from observational data by assuming item relevance is independent of position (after controlling for the ranking model's score). This lets you fit position effects as a regression:
# Fit position propensities from logged data
# Assume: log(P(click)) = log(P(examine|pos)) + log(P(relevant|item))
# We can model this as: logit(click) ~ position_features + item_features
from sklearn.linear_model import LogisticRegression
def estimate_position_propensities(click_data):
"""
Estimate relative position propensity scores from logged clicks.
Returns log-odds coefficients (not absolute probabilities).
Exponentiate the coefficients for use with IPS loss weighting.
Assumes item relevance is independent of position assignment.
"""
# One-hot encode positions
position_features = pd.get_dummies(click_data['position'], prefix='pos')
# Include item features to control for relevance
item_features = click_data[['predicted_score', 'item_popularity']]
X = pd.concat([position_features, item_features], axis=1)
y = click_data['clicked']
model = LogisticRegression()
model.fit(X, y)
# Extract position coefficients
pos_coefs = {col: model.coef_[0][i]
for i, col in enumerate(position_features.columns)}
return pos_coefs
This is noisier than experimental estimation but often good enough in practice.
Position debiasing in training:
Once you have position propensities, you can debias during training by weighting losses:
def position_debiased_loss(predictions, labels, positions, position_propensities):
"""
Weight each example by inverse position propensity.
Items in low-examination positions get upweighted.
"""
weights = 1.0 / position_propensities[positions]
# Optionally clip weights to reduce variance
weights = torch.clamp(weights, max=10.0)
per_example_loss = F.binary_cross_entropy(predictions, labels, reduction='none')
return (per_example_loss * weights).mean()
Negative sampling strategies:
Most implicit feedback models need negative examples. Instead of treating all unobserved items as negatives (which conflates "not exposed" with "not preferred"), you can:
def exposure_aware_negative_sampling(user, positive_items, exposure_log, n_negatives):
"""
Sample negatives from items the user was exposed to but didn't interact with.
Falls back to popularity sampling for users with limited exposure data.
"""
exposed_items = exposure_log.get(user, set())
candidate_negatives = exposed_items - set(positive_items)
if len(candidate_negatives) >= n_negatives:
return random.sample(list(candidate_negatives), n_negatives)
else:
# Combine available candidates with fallback samples
available = list(candidate_negatives)
needed = n_negatives - len(available)
return available + popularity_weighted_sample(needed)
Exposure-weighted losses:
If you know exposure probabilities, you can incorporate them directly into the loss:
Here is the sigmoid function and is the model's predicted relevance score for user and item . The weight upweights negatives that had high exposure probability—if a user saw an item but didn't interact, that's stronger negative signal than an item they never saw.
The idea: Ensure recommendations reflect the user's actual preference distribution, not a popularity-skewed version.
Steck (2018) introduced calibrated recommendations with an elegant formulation. If a user has watched 70% romance and 30% action movies, their recommendations should be approximately 70% romance and 30% action.
The objective balances relevance against calibration:
Where:
KL divergence measures how one probability distribution differs from another—here, how much the recommendation list's genre distribution differs from the user's historical preferences. A KL divergence of 0 means perfect calibration.
Greedy reranking:
Since optimizing this exactly is NP-hard, Steck proposed a greedy algorithm:
def calibrated_rerank(candidates, user_genre_dist, lambda_param=0.5, k=10):
"""
Greedy calibrated reranking.
Start with empty list, iteratively add items that maximize
relevance - lambda * calibration_loss.
"""
result = []
candidates = list(candidates)
for _ in range(k):
best_item = None
best_score = float('-inf')
for item in candidates:
# Score = relevance - lambda * calibration_loss
new_list = result + [item]
list_dist = compute_genre_distribution(new_list)
relevance = item['score']
cal_loss = kl_divergence(user_genre_dist, list_dist)
score = (1 - lambda_param) * relevance - lambda_param * cal_loss
if score > best_score:
best_score = score
best_item = item
result.append(best_item)
candidates.remove(best_item)
return result
This greedy approach achieves a approximation guarantee—meaning the greedy solution is guaranteed to be at least 63% as good as the optimal solution.
The core trade-off: Exploitation uses current knowledge to maximize immediate reward. Exploration gathers information to improve future recommendations.
Epsilon-greedy: With probability , recommend a random item instead of the predicted best.
def epsilon_greedy_recommend(user, model, epsilon=0.1, all_items=None):
"""Simple epsilon-greedy exploration."""
if random.random() < epsilon:
return random.choice(all_items)
else:
return model.recommend(user)
Upper Confidence Bound (UCB): Recommend items with high uncertainty. Score = predicted_relevance + exploration_bonus.
Where is item 's estimated average reward, is the total number of recommendations made so far, is how many times item has been shown, and is an exploration constant (typically 1-2).
Thompson Sampling: Maintain a posterior distribution over item quality. Sample from the posterior and recommend the sampled best.
The cost of exploration: Every exploratory recommendation potentially shows a user a suboptimal item. This has real cost. Production teams typically find that exploration must be balanced against long-term user satisfaction.
In my experience, most production systems use modest exploration rates (1-5%) combined with other diversity mechanisms.
Which of these techniques should you actually use? It depends on which biases affect you most.
Not all systems suffer from all biases equally. Here's a checklist to identify which ones affect you.
Metrics to Track
Track these regularly to monitor bias health:
Here are some rough thresholds I use to assess bias severity:
Position Bias:
Popularity Bias:
Catalog Coverage:
Feedback Loop Detection:
def compute_gini_coefficient(recommendation_counts):
"""
Compute Gini coefficient for recommendation frequency.
0 = perfect equality (all items recommended equally)
1 = perfect inequality (one item gets all recommendations)
"""
sorted_counts = np.sort(recommendation_counts)
n = len(sorted_counts)
cumulative = np.cumsum(sorted_counts)
gini = (2 * np.sum((np.arange(1, n+1) * sorted_counts))) / (n * np.sum(sorted_counts)) - (n + 1) / n
return gini
# Example usage
item_rec_counts = get_recommendation_counts_by_item()
gini = compute_gini_coefficient(item_rec_counts)
print(f"Gini coefficient: {gini:.3f}")
# Healthy: < 0.6
# Concerning: 0.6 - 0.8
# Severe: > 0.8
Every debiasing technique involves trade-offs. Here's an honest assessment.
Propensity estimation is hard. Most methods assume you can estimate propensities accurately, but in practice you're estimating from the same biased data you're trying to correct.
Trade-offs between biases. Fixing one bias can exacerbate another. Increasing exploration to reduce feedback loops might increase exposure bias for specific user sessions.
Evaluation remains challenging. Even with debiasing, offline evaluation is approximate. The only true test is online A/B testing with business metrics.
Computational cost. Many debiasing methods add complexity. Propensity weighting changes your loss landscape. Calibrated reranking is an additional post-processing step.
Bias in recommender systems isn't a bug to be eliminated - it's a fundamental property of observational data that must be understood and managed.
The key takeaways:
Observational data is not experimental data. Your training signal reflects system behavior and user self-selection, not just true preferences.
Six biases matter most: selection, position, exposure, popularity, conformity, and feedback loops. Different systems have different severity profiles.
Offline metrics can mislead. A model that fits biased data well will score well on biased test data but may disappoint online.
IPS provides a principled framework. Inverse propensity scoring lets you reweight observations to approximate the true distribution, though variance is a concern.
Position bias is well-understood. The examination hypothesis and propensity estimation methods work well when you can run controlled experiments.
Feedback loops compound over time. Small biases amplify with retraining. Monitor diversity metrics and consider exploration.
Every mitigation involves trade-offs. There's no free lunch. Choose techniques based on your specific bias profile and constraints.
The next time your offline metrics look great but online performance disappoints, don't blame the model—examine the data. Somewhere in those millions of clicks is a systematic distortion between what you observed and what users actually want.
The six biases we covered aren't bugs to eliminate. They're properties of observational data to understand and manage. The tools exist: IPS, position debiasing, calibration, exploration. Use them thoughtfully, measure the trade-offs, and remember—the goal isn't a perfect model. It's a model that actually serves users better.
For those who want to go deeper:
Foundational Papers:
Specialized Topics:
Practitioner Resources: