RL for Recommender Systems, Part 4: Implementation Guide
You've read three articles of theory: policy gradients, off-policy correction, retention optimization, cold-start solutions. If you're like me, you're now asking: "Okay, but where do I start?"
This article answers that question.
By the end of this article, you'll be able to:
- Understand Pinterest's two 2025 approaches: multi-objective tuning (DRL-PUT) and simulator-based RL (RecoMind)
- Choose the right RL approach for your specific recommendation problem
- Follow a practical implementation roadmap from bandits to full RL
- Avoid common pitfalls that trip up RL practitioners
We'll cover Pinterest's two 2025 papers (which represent what a production RL system looks like today), then give you a decision framework and implementation roadmap you can adapt to your own system.
Pinterest's RL Systems (2025)
Pinterest published two complementary papers in 2025, each tackling a different aspect of RL for recommendations.
DRL-PUT: Multi-Objective Utility Tuning
Paper: Deep Reinforcement Learning for Ranking Utility Tuning in the Ad Recommender System at Pinterest
Every ad ranking system has a utility function -- a formula that combines predicted click rates, conversion rates, and bid prices into a single score. Whoever sets the weights in that formula controls the balance between revenue and user experience.
At most companies, a cross-functional team manually tunes those weights. Product says "engagement matters more this quarter," someone bumps w_click from 1.0 to 1.5, and everyone watches the dashboards. This works, but the Pinterest team identified three problems:
- The search space is combinatorial. With a dozen-plus weights, the number of possible combinations is enormous. No team can explore this manually.
- Static weights ignore context. A user who clicks on everything should probably see different ad weighting than a user who rarely clicks but occasionally converts. The same applies to time-of-day and seasonality (Black Friday vs. a normal Tuesday).
- The tuning objective is vague. "Make engagement better without hurting revenue too much" is not a loss function.
DRL-PUT replaces manual tuning with an RL agent that predicts the optimal weights per ad request. The agent doesn't modify the ranking models themselves -- it sits on top of the existing pipeline and adjusts the coefficients that combine model outputs into a final score.
The Ranking Utility Formula
The ranking utility for a given ad takes this form:
where:
- is the predicted probability of the action the ad campaign optimizes for (click, conversion, or impression, depending on campaign type -- more on campaign types below)
- is the advertiser's bid price for that action
- is the predicted probability of engagement type : click, click30 (user spends more than 30 seconds on the ad's landing page, indicating genuine interest rather than an accidental tap), or conversion
- are the weights DRL-PUT learns to tune per-user
- is a reserve price threshold -- ads with estimated revenue below get filtered out entirely
- is an indicator function returning 1 if the condition is met, 0 otherwise
Inside the parentheses, the first component -- -- captures revenue. The summation captures user engagement value. DRL-PUT learns both the engagement weights and the reserve price dynamically based on who the user is and what context they're in.
Why REINFORCE? (And Why Actor-Critic Failed)
If you've read Part 2, you know the two main families of policy gradient methods: REINFORCE and Actor-Critic. Pinterest tried both. Actor-Critic failed.
The reason is specific to ad recommender systems: estimating a value function is "prohibitively difficult" due to "high variance and unbalanced distribution of immediate reward for each state." In plain terms, ad rewards are extremely skewed -- most ad impressions generate zero revenue, a few generate a lot. The critic can't learn a stable baseline from this distribution.
The paper tested 6 configurations in offline experiments. The table below previews the action space designs we'll explain in detail shortly -- for now, focus on the Pass/Fail pattern:
| Config | Action Space | Exploration | Algorithm | Diversity | Relative Gain |
|---|---|---|---|---|---|
| 1 | Continuous | Gaussian | Actor-Critic | Fail | Fail |
| 2 | Continuous | Uniform | Actor-Critic | Fail | Fail |
| 3 | Discretized | Gaussian | Actor-Critic | Fail | Fail |
| 4 | Discretized | Uniform | Actor-Critic | Fail | Fail |
| 5 | Grouped | Uniform | Actor-Critic | Fail | Fail |
| 6 | Grouped | Uniform | REINFORCE | Pass | Pass |
Only one configuration worked: grouped discrete actions + uniform exploration + REINFORCE. Every Actor-Critic variant failed on both offline metrics.
Tip
This is a useful result even if you never build this system. When your RL problem has extremely skewed rewards (common in ads and e-commerce), any method that requires estimating a value function -- including Actor-Critic -- can struggle because the critic sees mostly-zero rewards with occasional outliers. Pure policy gradient methods like REINFORCE that learn directly from rewards without a value estimate may be more robust.
DRL-PUT uses a modified, single-step version of REINFORCE with discount factor . That makes it a completely myopic agent -- it only optimizes for the immediate reward of each ad request, not a trajectory of future interactions.
This might seem to contradict the series' premise that myopic optimization fails. The difference: in Part 1, the problem was being myopic about what to optimize (clicks instead of long-term engagement). DRL-PUT is myopic about when to optimize (this request only), but it optimizes a multi-objective reward that captures engagement, revenue, and user value simultaneously. For ad ranking, choosing the right objective matters more than planning multiple steps ahead -- and the paper notes that extending to with long-term rewards is future work.
The update rule is mini-batch averaged:
This is vanilla REINFORCE (see Part 2 for the derivation) with two modifications: (1) each sample is a single-step interaction (since , there are no multi-step trajectories), and (2) gradients are averaged over a mini-batch of ad requests rather than computed from one sample. The averaging reduces gradient variance -- important because individual ad impressions are noisy signals.
You might notice there's no importance sampling correction, even though the data is off-policy (collected by uniform random exploration). With and a single step, this is effectively a reward-weighted classification problem. And because the behavior policy is uniform, every action is equally likely to appear in the training data -- no action is systematically over- or under-represented. Combined with the single-step setting (no trajectory-level compounding of probabilities), this eliminates the distribution mismatch that importance sampling normally corrects for.
The result: the model learns to assign higher probability to actions that received higher reward. This differs from the trajectory-level off-policy correction in Part 2, where importance weights correct for the compounding effect of different action probabilities across multiple steps.
Action Space: Three Iterations to Get Right
With the algorithm chosen, the next challenge was designing a tractable action space. The design went through three iterations, and the story is worth telling because it illustrates a general lesson: start simple, discover what breaks, and constrain further.
Iteration 1 -- Continuous : Let the agent output continuous weight values. This failed because they couldn't get enough training data without a simulator. The model wouldn't converge when the number of training examples was "much smaller than the continuous action space."
Iteration 2 -- Discretized (): Partition each weight into equally spaced values within a predefined range . With hyperparameters (easily double digits), this gives possible actions -- still too large for the configurations they tested (though the failures may also stem from Actor-Critic rather than action space size alone).
Iteration 3 -- Grouped discrete (): Group semantically related weights together so they always share the same discretization index. For example, and are physically related (both measure click engagement), so their weights move together. But is a different behavior, so it gets its own group. The three groups are:
- Click-related weights (, )
- Conversion-related weights ()
- Reserve price threshold ()
With 3 groups and 10 values each: total actions. This is small enough for a softmax output layer and large enough to express meaningful variation.
Note
The key insight is that semantic grouping reduces dimensionality without losing expressiveness. Weights that should move together (click and long-click) are forced to, which acts as a strong inductive bias. This is similar to weight tying in neural networks -- constraining the model makes it easier to learn.
State Representation
The action space defines what the agent can do. The state defines what it sees. The state captures three categories of features when an ad request arrives:
- User profile: age, gender, metro area
- User activities: historical action counts, both on-site (clicked items) and off-site (purchased items)
- Contextual: hour of day, day of week, IP country
Notably, the paper excludes real-time supply/demand information (total ad budget in inventory, current user traffic) even though it would help. The reason is practical: "such information is either hard to measure or infeasible to be obtained during serving time." This is a common production constraint -- features that would be useful in theory but impossible to compute at the latency required for real-time ad serving.
Network Architecture
Given this state representation, the policy model is straightforward: an MLP that maps states to a probability distribution over 1,000 actions:
- Each categorical feature passes through its own embedding layer
- Embeddings are concatenated with numerical and dense vector features
- The combined representation passes through hidden layers: batch normalization -> linear -> ReLU
- A final softmax layer outputs over all 1,000 discretized actions
- At serving time:
One forward pass per ad request. No value function to evaluate, no search over actions. This is what makes REINFORCE practical here -- the policy network directly outputs action probabilities, and you just take the argmax.
Reward Design: Campaign-Type-Specific
The reward design required more iteration than the architecture. Don't confuse the ranking utility (which scores each ad to determine display order) with the reward (which tells the RL agent how good its chosen weights were). The utility is what the agent outputs -- the reward is the feedback it receives. They share similar components but serve different roles.
A quick primer on ad campaign types, since the reward depends on them: when advertisers create a campaign, they choose what they're willing to pay for. A click-through campaign pays per click (CPC), a conversion campaign pays per purchase or sign-up (CPA), and an impression campaign pays per ad view (CPM). This choice shapes the entire optimization objective -- the platform needs to maximize value differently for each type.
The reward has two components (note: these use predicted probabilities from existing ranking models, not observed user actions -- the reward is computed at serving time before the ad is shown):
Estimated Revenue varies by campaign type:
Estimated User Value is a weighted combination of engagement probabilities:
where differ by campaign type (note: this is a reward coefficient, not the discount factor from the REINFORCE equation above):
| Campaign Type | (click) | (click30) | (conversion) |
|---|---|---|---|
| Click-through | 1.0 | 0.5 | 0.0 |
| Conversion | 0.1 | 0.4 | 0.5 |
| Impression | 0.0 | 0.0 | 0.0 |
The logic: click-through campaigns care about clicks (obviously), conversion campaigns weight conversions heavily but still want some click signal, and impression campaigns only optimize revenue (the advertiser pays per impression regardless of user behavior).
Warning
Getting the reward wrong is the easiest way to break this system. The paper's ablation study shows that using revenue-only reward () caused CTR to drop from +9.71% to -0.74%. The agent learned to maximize revenue at the expense of engagement. Campaign-specific reward tuning is not optional.
One more detail that matters in practice: min-max batch normalization on the engagement probabilities (, , ) "significantly improves the stability and convergence speed of training." For each mini-batch of samples, each probability is normalized to by subtracting the batch minimum and dividing by the range. Without this, the different probability scales cause training instability.
Exploration and Data Collection
With the reward function defined, the remaining question is where the training data comes from. DRL-PUT is trained off-policy. Only 0.5% of production traffic is reserved for exploration -- the rest uses the production ranking weights. On that exploration slice, actions are sampled from a uniform distribution over all 1,000 discretized actions.
They tried Gaussian exploration first (centered on the current production weights, so most samples stay close to the status quo). It failed due to "insufficient exploration" -- too many sampled actions clustered near the production values, leaving the agent with almost no data about actions far from the current policy. Uniform random solved this: every action gets roughly equal coverage across the exploration traffic.
The training loop is continuous: (state, action, reward) triplets stream in from online serving logs, and the model trains on them in real time. This is off-policy because the behavior policy (uniform random on the exploration slice) differs from the learned policy.
Safety Mechanisms
Uniform random exploration over ad weights sounds risky -- bad weight combinations could surface irrelevant ads. Deploying RL in an ad system requires multiple safety layers:
- Bounded action space: All weights and the reserve price are constrained to predefined ranges. The agent can't set .
- Reserve price filtering: The indicator hard-filters low-revenue ads before ranking, preventing the agent from surfacing ads that generate negligible revenue.
- Minimal exploration traffic: 0.5% of requests. If the exploration policy performs poorly, 99.5% of users are unaffected.
- Offline evaluation gates: Before a trained model goes to production, it must pass two offline metrics: Diversity (action diversity -- does the model produce varied actions for different states, or has it collapsed to always choosing the same action?) and Relative_Gain (does the candidate policy outperform the behavior policy in reward-weighted expectation?).
- Decoupled from ranking models: The RL agent only adjusts weights on top of existing model outputs. It never modifies the click or conversion prediction models themselves. If the RL agent fails, you revert to static weights -- nothing else is affected.
Personalization Evidence
One of the more interesting results from the paper: the model genuinely personalizes. When users are bucketed by historical CTR, the model increases while decreasing both and the reserve price . It learned to show these users more engaging content and apply a lower revenue bar. For high-CVR users, the reverse: higher reserve price (these users are more valuable to conversion-focused advertisers) and higher conversion weight.
A smart product manager would make the same trade-off -- but the model does it per-request rather than per-quarter.
Production Results
With these design choices in place, here's what Pinterest measured. They deployed DRL-PUT in an online A/B experiment. The "treated segment" is the user population whose ads were ranked using the DRL-PUT model; "platform-wide" includes both treated and control users:
| Metric | Platform-wide | Treated Segment |
|---|---|---|
| Revenue | +0.27% | -0.16%* |
| Impressions | +0.02%* | -0.08%* |
| CTR | +1.62% | +9.71% |
| CTR30 (long-click) | +1.03% | +7.73% |
| CVR | +0.67% | +1.26% |
* = statistically insignificant
The headline numbers: +9.7% CTR and +7.7% long-click rate on the treated segment, with a statistically insignificant -0.16% revenue change (indistinguishable from zero). The paper notes that "0.5% increase in CTR is considered as a substantial gain" at Pinterest's scale, making the ~10% improvement especially notable.
The revenue result is a predictable consequence of the reward design: the agent learned to favor engaging ads over high-bid-but-boring ones, which is exactly what a reward combining revenue and engagement would produce. Platform-wide (aggregated across all traffic), all metrics improved.
RecoMind: Simulator-Based RL
Paper: RecoMind: Recommender System Simulation for RL-Based Optimization
Here's the problem with RL in production: you can't experiment freely with real users. A bad policy means degraded user experience, and you don't know it's bad until users suffer.
RecoMind's solution: build a simulator from your existing supervised models, then train RL safely in simulation.
Simulator Architecture
The simulator predicts how users would respond to any recommendation. It's trained on historical data to mimic real user behavior.
Bootstrapping from Production
RecoMind doesn't start from scratch. The initial RL policy is the production ranking model. Training then improves on this baseline.
Quick primer on Q-learning (if you're coming from REINFORCE in Part 2): Unlike REINFORCE which learns a policy directly, Q-learning learns a value function that predicts total future reward. We then act greedily: pick the action with highest . The "target network" is a slowly-updated copy of used to stabilize training—without it, we'd be chasing a moving target.
# Simplified RecoMind training loop
def train_recomind(simulator, production_policy, num_iterations, gamma=0.75):
"""
Train RL policy using simulator.
Args:
simulator: User behavior simulator
production_policy: Current production model to bootstrap from
num_iterations: Number of training iterations
gamma: Discount factor (0.75 = ~4 step effective horizon, appropriate for
in-session optimization where sessions are short)
"""
# Initialize RL policy from production
rl_policy = copy.deepcopy(production_policy)
target_network = copy.deepcopy(rl_policy)
for iteration in range(num_iterations):
# Sample trajectories in simulator
states, actions, rewards, next_states = simulator.rollout(rl_policy)
# TD loss for Q-learning
with torch.no_grad():
next_q = target_network(next_states).max(dim=1).values
# Note: In production, multiply by (1 - done) to zero out bootstrapping at episode ends
td_target = rewards + gamma * next_q # Discounted future value
current_q = rl_policy(states).gather(1, actions.unsqueeze(1))
loss = F.mse_loss(current_q.squeeze(), td_target)
# Update policy
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Soft update target network
soft_update(target_network, rl_policy, tau=0.001)
return rl_policy
Exploration Strategy
RecoMind uses a hybrid exploration strategy:
- ε-greedy: With probability ε, take a random action
- Softmax over top-K: Instead of purely random, sample from a softmax over the top-K actions
This balances exploration (finding new good recommendations) with staying close to known-good options (not deviating too far from production).
Simulator Validation
A simulator is only useful if it's accurate. RecoMind validates through:
- Offline simulation testing: Run 100,000 episodes with the trained policy and measure predicted metrics
- Online A/B testing: Deploy to a held-out population (40M users in their case) and verify simulation predictions match online performance
- Metric correlation: Ensure offline metric improvements correlate with online improvements
Results
Pinterest deployed RecoMind-trained policies:
- +15.81% long watch time
- +4.71% session depth (users view more content)
Simulation enabled testing without risking user experience.
Trade-offs Comparison
| Aspect | DRL-PUT | RecoMind |
|---|---|---|
| What it learns | Ranking weights | Full ranking policy |
| Action space | Discretized weights | Item selection |
| Training data | Online | Simulated |
| Safety | Bounded actions | Simulation isolation |
| Complexity | Moderate | High (requires simulator) |
Pinterest chose two complementary approaches for different problems. But your system isn't Pinterest. Let's build a framework for choosing your own path.
Decision Framework
After covering seven papers, how do you choose the right approach for your system?
Which RL Approach Should You Use?
Answer a few questions about your recommendation system to get a personalized approach recommendation based on the papers covered in this series.
What is your primary optimization objective?
Summary Table
| Objective | Recommended Approach | Paper |
|---|---|---|
| Immediate engagement, large catalog | REINFORCE + Top-K | YouTube 2019 |
| Need stable training | Actor-Critic | Google 2022 |
| Offline policy evaluation | Control Variates | Netflix 2021 |
| Long-term retention | RLUR dual-critic | Kuaishou 2023 |
| Cold-start / new content | Impatient Bandits | Spotify 2023 |
| Multi-objective balance | DRL-PUT | Pinterest 2025 |
| Fast safe iteration | Simulator RL | Pinterest 2025 |
What This Series Taught Us
Looking back across all four parts, the core lesson is this: successful RL for recommendations is about managing the gap between textbook RL and production reality.
Textbook RL assumes: online learning, immediate rewards, single-step actions, stable environments. Production has: logged data, delayed signals, slates of items, evolving user preferences. Every paper we covered addresses one of these gaps:
- YouTube REINFORCE: Handles massive action spaces and off-policy data
- Google Actor-Critic: Reduces variance for stable training
- Netflix Slate OPE: Evaluates slate policies offline
- Kuaishou RLUR: Bridges immediate and retention rewards
- Spotify Impatient Bandits: Uses partial signals for faster learning
- Pinterest DRL-PUT: Dynamically balances multiple objectives
- Pinterest RecoMind: Enables safe offline iteration via simulation
The successful papers we covered share a pattern: they explicitly identify which RL assumptions their problem violates, then design around those gaps.
Starting Recommendation
If you're building RL for recommendations today, here's my recommended approach:
-
Build a simulator first (RecoMind approach)
- Use your existing supervised models as the foundation
- Validate simulator accuracy carefully
- This enables fast, safe experimentation
-
Bootstrap from production
- Start with your current ranking model as the initial policy
- RL then learns to improve on this baseline
-
Use DRL-PUT for multi-objective tuning
- If you have competing objectives (engagement, revenue, fairness)
- Learn to dynamically balance rather than static weights
-
Add RLUR-style retention optimization
- Once engagement is solid, add retention objective
- Dual critics provide learning signal at both timescales
- Note: This requires substantial infrastructure (retention signal takes days to materialize, needs user-group normalization)
-
Deploy Impatient Bandits for cold-start
- New content needs exploration
- Bayesian updating accelerates quality discovery
Implementation Roadmap
Phase 1: Contextual Bandits Foundation
Before jumping to full RL, validate that learning-based decisions beat heuristics. Start with contextual bandits—they're simpler and faster to debug.
Why start here? If your recommendation problem doesn't benefit from personalized exploration (bandits), it won't benefit from sequential optimization (full RL). Bandits are your sanity check. They answer: "Does intelligent exploration find better items than our current heuristics?" If no, stop here—something is wrong with your reward signal or action space.
What to expect: A well-tuned LinUCB should beat random exploration within a few thousand interactions and approach your best static policy within 10-50K interactions (depending on feature quality). If you're not seeing this, debug before proceeding.
LinUCB intuition: LinUCB predicts rewards as a linear function of features, and tracks uncertainty about those predictions. Think of it this way: the matrix accumulates information about which features you've seen. Early on, is small (little information), so is large (high uncertainty). As you observe more data, grows and shrinks—you become more confident.
The estimate gives your best guess of reward weights. The term measures how uncertain you are for this specific context. LinUCB adds the uncertainty to the estimate: items you're unsure about get a bonus. This encourages exploration—you try uncertain items to learn more about them.
class LinUCB:
"""
Linear Upper Confidence Bound bandit for personalized ranking.
"""
def __init__(self, d, alpha=1.0):
self.d = d # Feature dimension
self.alpha = alpha # Exploration parameter
self.A = {} # Per-arm covariance matrices
self.b = {} # Per-arm reward vectors
def get_arm_params(self, arm):
if arm not in self.A:
self.A[arm] = np.eye(self.d)
self.b[arm] = np.zeros(self.d)
return self.A[arm], self.b[arm]
def select_arm(self, context, arms):
"""
Select arm with highest UCB score.
Args:
context: [d] feature vector
arms: list of available arms
Returns:
Selected arm
"""
best_arm = None
best_score = float('-inf')
for arm in arms:
A, b = self.get_arm_params(arm)
A_inv = np.linalg.inv(A)
# Estimated reward
theta = A_inv @ b
mean = context @ theta
# Confidence bound
std = np.sqrt(context @ A_inv @ context)
# UCB score
score = mean + self.alpha * std
if score > best_score:
best_score = score
best_arm = arm
return best_arm
# Note: The matrix inverse inside select_arm is O(d^3) per arm.
# For production, use Sherman-Morrison incremental updates or solve
# the linear system directly. This code is for pedagogy.
def update(self, arm, context, reward):
"""Update arm parameters with observed reward.
Key insight: A accumulates the "information matrix" (sum of x*x^T).
When we see context x and reward r:
- A grows by x*x^T (we've seen this direction in feature space)
- b grows by r*x (we've associated this direction with reward r)
Then theta = A^{-1}b is the least-squares estimate of reward weights.
The more observations in a direction, the larger A is in that direction,
so A^{-1} is smaller -> lower uncertainty -> smaller exploration bonus.
"""
A, b = self.get_arm_params(arm)
self.A[arm] = A + np.outer(context, context) # Information accumulation
self.b[arm] = b + reward * context # Reward-weighted direction
Metrics to track:
- Regret compared to best-arm-in-hindsight
- Exploration rate (how often do you recommend uncertain items?)
- Business metrics (engagement, revenue)
Phase 2: Add REINFORCE
Once bandits show value—meaning they beat your heuristic baseline on a held-out A/B test—graduate to policy gradient methods.
When to make this jump: You should move to REINFORCE when you have evidence that sequence matters. Signs include: (1) bandit performance varies based on what was recently shown, (2) session-level metrics don't match single-decision metrics, (3) users who see diverse content early in sessions engage more later. If decisions are truly independent, stick with bandits—they're simpler and often better.
Prerequisites check: Before implementing REINFORCE, ensure you have: (1) behavior policy probabilities logged for every recommendation, (2) a clear reward signal that captures session-level or long-term value, (3) infrastructure to train on sequences of (state, action, reward) tuples rather than individual examples.
import torch.nn as nn
import torch.nn.functional as F
class REINFORCERecommender(nn.Module):
"""
REINFORCE recommender with importance sampling and Top-K correction.
See Part 2 for detailed explanation of these components.
"""
def __init__(self, state_dim, item_dim, hidden_dim=256):
super().__init__()
self.user_encoder = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim)
)
self.item_encoder = nn.Sequential(
nn.Linear(item_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim)
)
def compute_policy(self, user_state, items):
"""
Compute softmax policy over items.
Args:
user_state: [batch, state_dim]
items: [num_items, item_dim]
Returns:
policy: [batch, num_items] probabilities
"""
user_emb = self.user_encoder(user_state) # [batch, hidden]
item_emb = self.item_encoder(items) # [num_items, hidden]
# Dot product scores
scores = user_emb @ item_emb.T # [batch, num_items]
# Softmax policy
policy = F.softmax(scores, dim=-1)
return policy
def reinforce_loss(self, user_states, items, actions, rewards,
behavior_probs, k=16, cap=10.0):
"""
REINFORCE loss with importance sampling and Top-K correction.
"""
# Current policy probabilities
policy = self.compute_policy(user_states, items)
pi = policy.gather(1, actions.unsqueeze(1)).squeeze()
# Importance weights (capped)
weights = (pi / behavior_probs).clamp(max=cap)
# Top-K correction
lambda_k = k * (1 - pi) ** (k - 1)
# Log probability
log_pi = torch.log(pi + 1e-10)
# REINFORCE gradient
loss = -(weights * lambda_k * rewards * log_pi).mean()
return loss
Key additions over bandits:
- Policy is a neural network (richer representations)
- Off-policy correction enables learning from logged data
- Top-K correction for slate recommendations
Phase 3: Build Simulator
This is where iteration speed increases dramatically. Instead of waiting weeks for A/B test results, you can test policy changes in minutes.
class UserSimulator:
def __init__(self, click_model, watch_model, session_model):
self.click_model = click_model # P(click | user, item)
self.watch_model = watch_model # E[watch_time | user, item, click]
self.session_model = session_model # P(continue | user, history)
def simulate_session(self, user_state, policy, max_steps=20):
"""
Simulate a user session under a given policy.
Returns:
trajectory: list of (state, action, reward, next_state)
"""
trajectory = []
state = user_state
for step in range(max_steps):
# Policy selects item
items = self.get_candidate_items(state)
action = policy.select(state, items)
# Simulate user response
click_prob = self.click_model(state, action)
clicked = np.random.random() < click_prob
if clicked:
watch_time = self.watch_model(state, action)
else:
watch_time = 0
reward = watch_time # Or other reward function
# State transition
next_state = self.transition(state, action, clicked, watch_time)
trajectory.append((state, action, reward, next_state))
# Check if session continues
continue_prob = self.session_model(next_state)
if np.random.random() > continue_prob:
break
state = next_state
return trajectory
Validation is critical:
- Compare simulated vs. actual session statistics
- Check that policy rankings are preserved
Phase 4: Add Long-Term Objectives
Following RLUR's approach:
class DualCriticAgent:
def __init__(self, state_dim, action_dim):
self.actor = PolicyNetwork(state_dim, action_dim)
self.critic_immediate = ValueNetwork(state_dim)
self.critic_retention = ValueNetwork(state_dim)
def compute_loss(self, batch, alpha=0.3):
"""
Dual-critic loss combining immediate and retention signals.
Args:
batch: (states, actions, imm_rewards, ret_rewards, next_states)
alpha: weight on retention objective (0.3 is typical; can increase over training)
"""
states, actions, imm_rewards, ret_rewards, next_states = batch
# Immediate critic
v_imm = self.critic_immediate(states)
v_imm_next = self.critic_immediate(next_states).detach()
td_imm = imm_rewards + 0.99 * v_imm_next
loss_imm = F.mse_loss(v_imm, td_imm)
# Retention critic (longer horizon, sparser updates)
v_ret = self.critic_retention(states)
v_ret_next = self.critic_retention(next_states).detach()
td_ret = ret_rewards + 0.999 * v_ret_next # Higher gamma for long-term
loss_ret = F.mse_loss(v_ret, td_ret)
# Combined advantage for actor
adv_imm = (td_imm - v_imm).detach()
adv_ret = (td_ret - v_ret).detach()
advantage = (1 - alpha) * adv_imm + alpha * adv_ret
# Actor loss
log_probs = self.actor.log_prob(states, actions)
loss_actor = -(advantage * log_probs).mean()
return loss_actor + loss_imm + loss_ret
When RL Is the Wrong Tool
Before discussing mistakes, a reality check: RL doesn't always help. Skip RL entirely when:
- Decisions are truly independent: If showing item A today doesn't affect what users want tomorrow, bandits are sufficient.
- Reward signals are clean and immediate: If you have reliable, instant feedback, simpler methods often win.
- Your action space is small and enumerable: With fewer than 100 items, exhaustive exploration is feasible.
- You lack behavior policy logging: Without , off-policy RL is impossible.
The papers in this series succeeded because their problems genuinely needed sequential optimization. Don't assume yours does—validate with simpler methods first.
Bottom line: If you're unsure, start with a bandit pilot. If bandits don't beat your baseline, RL won't either.
Common Mistakes (Learn from Others' Pain)
-
Starting with full RL before validating with bandits: If bandits don't beat heuristics, full RL won't either. Bandits are your sanity check. I've seen teams spend 6 months building a REINFORCE pipeline only to discover their reward signal was too noisy to learn anything. Bandits would have revealed this in 2 weeks.
-
Optimizing a proxy reward: Watch time is not retention. Clicks are not satisfaction. Make sure your reward signal actually tracks what you care about.
-
Ignoring exploration in evaluation: Your A/B test metrics will underestimate RL gains because the logging policy didn't explore. Use OPE to estimate what exploration would have found.
-
Deploying without guardrails: RL policies can find weird solutions. Set hard constraints on diversity, ad load, and content quality before deployment.
-
Not logging behavior policy probabilities: You can't do off-policy correction without . Log it at serving time. This is non-negotiable.
Troubleshooting Guide
Problem: Training loss is unstable (oscillating or diverging)
- Check importance weights: if they're frequently hitting the cap, lower or investigate why and differ so much
- Check for NaN in log-probabilities: ensure you're using
log_softmax, notlog(softmax(...)) - Reduce learning rate by 10x and see if it stabilizes
Problem: Policy collapses to always recommending the same items
- Add entropy regularization to the actor loss:
-entropy_weight * policy.entropy().mean() - Check if your reward signal has enough variance—constant rewards give no learning signal
- Verify exploration is happening: log the entropy of your policy distribution
Problem: Offline metrics look good but online A/B test shows no improvement
- Your simulator or OPE may have distribution shift—retrain on more recent data
- Check that you're logging correctly (a common bug is logging post-softmax instead of pre-softmax)
- Verify the action space matches: if production has more/different items, your policy may be misaligned
Problem: Actor-Critic critic loss doesn't decrease
- Target network might be updating too fast—decrease to 0.001
- Reward scale might be wrong—normalize rewards by their standard deviation
- Check for reward delay: if rewards arrive after state transitions, your TD targets are wrong
Conclusion
We started with a frustrating problem: your CTR model gets better, but nothing that matters improves. Now you have a toolkit to fix it.
Supervised learning predicts; RL decides. When you care about sequences, long-term outcomes, or exploration, prediction alone isn't enough.
But "use RL" is too vague. Now you know: start with a simulator (so you can iterate safely), bootstrap from your production model (so you're improving on a known baseline), and graduate from bandits to REINFORCE to actor-critic as your problem demands.
YouTube reported +0.85% watch time, Pinterest +15.81% long watch time, and Kuaishou statistically significant DAU improvements (which they note translates to ~600K additional daily active users at their scale). These are paper-reported results from production deployments; your mileage will vary.
Remember that flat retention despite better CTR from Part 1? Now you know how to fix it: prediction got you here, but decisions will get you further.
Here's your Monday morning action: Look at your logging infrastructure. Are you recording the probability that your current system assigned to each recommendation? If not, you can't do off-policy learning, and everything in this series is theoretical. Fix that first. Then build a user simulator using your existing click and watch-time prediction models. Run 1,000 simulated sessions. That's your sandbox for RL experiments.
Your move.
Further Reading
Foundational Papers:
- Policy Gradient Methods for RL with Function Approximation (Sutton et al., 1999)
- Off-Policy Actor-Critic (Degris et al., 2012)
Industry Papers Covered:
- YouTube REINFORCE (2019)
- Google Actor-Critic (2022)
- Netflix Slate OPE (2021)
- Kuaishou RLUR (2023)
- Spotify Impatient Bandits (2023)
- Pinterest DRL-PUT (2025)
- Pinterest RecoMind (2025)
Bonus: Building Your Own Simulator
If you're serious about RL for recommendations, building a simulator pays off quickly. Pinterest's RecoMind shows the payoff: simulation let them test policy changes in minutes instead of waiting weeks for A/B results.
Why Simulators?
- Fast iteration: Run thousands of experiments in hours, not months
- Safe exploration: Test risky policies without affecting real users
- Counterfactual analysis: Answer "what if" questions about past decisions
- Debug and interpret: Understand why policies behave certain ways
Key Components
1. User Response Models
Train supervised models to predict user behavior:
# Click model: P(click | user, item, context)
click_model = train_binary_classifier(
features=['user_history', 'item_features', 'context'],
target='clicked'
)
# Watch time model: E[watch_time | user, item, clicked=True]
watch_model = train_regressor(
features=['user_history', 'item_features', 'context'],
target='watch_time',
filter_condition='clicked == True'
)
2. State Transition Model
How does user state evolve after each interaction?
def transition(user_state, action, response):
"""
Update user state based on interaction.
"""
new_state = user_state.copy()
# Add interaction to history
new_state['history'].append({
'item': action,
'clicked': response['clicked'],
'watch_time': response['watch_time']
})
# Update interest vectors (simplified)
if response['clicked']:
item_embedding = get_embedding(action)
new_state['interest'] = (
0.9 * new_state['interest'] +
0.1 * item_embedding
)
return new_state
3. Validation Suite
The simulator is only useful if it's accurate:
def validate_simulator(simulator, holdout_data):
"""
Validate simulator against held-out real data.
"""
metrics = {}
# 1. Response prediction accuracy
predicted_clicks = []
actual_clicks = []
for session in holdout_data:
for state, action, response in session:
pred = simulator.click_model(state, action)
predicted_clicks.append(pred)
actual_clicks.append(response['clicked'])
metrics['click_auc'] = roc_auc_score(actual_clicks, predicted_clicks)
# 2. Session length distribution
# ks_2samp: Kolmogorov-Smirnov test comparing two distributions
# Returns statistic between 0-1; lower = more similar distributions
sim_lengths = [len(simulator.simulate_session(s)) for s in holdout_states]
real_lengths = [len(s) for s in holdout_data]
metrics['length_ks'] = ks_2samp(sim_lengths, real_lengths).statistic
# 3. Engagement distribution
sim_engagement = [sum(r['watch_time'] for r in s) for s in simulated_sessions]
real_engagement = [sum(r['watch_time'] for r in s) for s in holdout_data]
metrics['engagement_ks'] = ks_2samp(sim_engagement, real_engagement).statistic
return metrics
Common Pitfalls
Distribution shift: The simulator learns from behavior policy data. When evaluating a very different policy, predictions may be unreliable. Solution: Use uncertainty estimates and stay close to the behavior distribution.
Missing confounders: Some factors affecting user behavior aren't in your features (mood, external distractions). Solution: Model aleatoric uncertainty (irreducible randomness in user behavior—some variance can't be explained no matter how good your model) and validate on diverse data.
Reward hacking: RL can find policies that exploit simulator bugs rather than genuinely improving recommendations. Solution: Regularly validate against A/B tests.
| Paper | Company | Year | Objective | Approach | Results |
|---|---|---|---|---|---|
| YouTube REINFORCE | YouTube | 2019 | Watch time | Policy Gradient + Top-K | +0.85% watch time |
| Actor-Critic Rec | 2022 | User enjoyment | Actor-Critic + Value function | +0.07% (on top of REINFORCE) | |
| Slate OPE | Netflix | 2021 | Offline evaluation | Control Variates | O(K) variance reduction |
| RLUR | Kuaishou | 2023 | User retention | Dual-Critic RL | +0.2% DAU |
| Impatient Bandits | Spotify | 2023 | Cold-start | Bayesian + Meta-learning | Near-optimal performance |
| DRL-PUT | 2025 | Multi-objective | Utility tuning RL | +9.7% CTR | |
| RecoMind | 2025 | Session optimization | Simulator-based RL | +15.81% long watch |
Series Overview: