RL for Recommender Systems, Part 4: Implementation Guide

You've read three articles of theory: policy gradients, off-policy correction, retention optimization, cold-start solutions. If you're like me, you're now asking: "Okay, but where do I start?"

This article answers that question.

By the end of this article, you'll be able to:

Understand Pinterest's two 2025 approaches: multi-objective tuning (DRL-PUT) and simulator-based RL (RecoMind)
Choose the right RL approach for your specific recommendation problem
Follow a practical implementation roadmap from bandits to full RL
Avoid common pitfalls that trip up RL practitioners

We'll cover Pinterest's two 2025 papers (which represent what a production RL system looks like today), then give you a decision framework and implementation roadmap you can adapt to your own system.

#Pinterest's RL Systems (2025)

Pinterest published two complementary papers in 2025, each tackling a different aspect of RL for recommendations.

#DRL-PUT: Multi-Objective Utility Tuning

Paper: Deep Reinforcement Learning for Ranking Utility Tuning in the Ad Recommender System at Pinterest

Every ad ranking system has a utility function -- a formula that combines predicted click rates, conversion rates, and bid prices into a single score. Whoever sets the weights in that formula controls the balance between revenue and user experience.

At most companies, a cross-functional team manually tunes those weights. Product says "engagement matters more this quarter," someone bumps w_click from 1.0 to 1.5, and everyone watches the dashboards. This works, but the Pinterest team identified three problems:

The search space is combinatorial. With a dozen-plus weights, the number of possible combinations is enormous. No team can explore this manually.
Static weights ignore context. A user who clicks on everything should probably see different ad weighting than a user who rarely clicks but occasionally converts. The same applies to time-of-day and seasonality (Black Friday vs. a normal Tuesday).
The tuning objective is vague. "Make engagement better without hurting revenue too much" is not a loss function.

DRL-PUT replaces manual tuning with an RL agent that predicts the optimal weights per ad request. The agent doesn't modify the ranking models themselves -- it sits on top of the existing pipeline and adjusts the coefficients that combine model outputs into a final score.

#The Ranking Utility Formula

The ranking utility for a given ad takes this form:

U = \mathbf{1}(\text{Estimated\_Revenue} \geq b) \cdot \left( p(\text{optimized\_action}) \cdot \text{bid} + \sum_{i=1}^{n-1} p(\text{engagement\_action}_i) \cdot w_i \right)

where:

$p(\text{optimized\_action})$ is the predicted probability of the action the ad campaign optimizes for (click, conversion, or impression, depending on campaign type -- more on campaign types below)
$\text{bid}$ is the advertiser's bid price for that action
$p(\text{engagement\_action}_i)$ is the predicted probability of engagement type $i$ : click, click30 (user spends more than 30 seconds on the ad's landing page, indicating genuine interest rather than an accidental tap), or conversion
$w_i$ are the weights DRL-PUT learns to tune per-user
$b$ is a reserve price threshold -- ads with estimated revenue below $b$ get filtered out entirely
$\mathbf{1}(\cdot)$ is an indicator function returning 1 if the condition is met, 0 otherwise

Inside the parentheses, the first component -- $p(\text{optimized\_action}) \cdot \text{bid}$ -- captures revenue. The summation $\sum p(\text{engagement\_action}_i) \cdot w_i$ captures user engagement value. DRL-PUT learns both the engagement weights $w_i$ and the reserve price $b$ dynamically based on who the user is and what context they're in.

DRL-PUT: RL agent tunes utility weights per-request, existing ranking models unchanged

#Why REINFORCE? (And Why Actor-Critic Failed)

If you've read Part 2, you know the two main families of policy gradient methods: REINFORCE and Actor-Critic. Pinterest tried both. Actor-Critic failed.

The reason is specific to ad recommender systems: estimating a value function is "prohibitively difficult" due to "high variance and unbalanced distribution of immediate reward for each state." In plain terms, ad rewards are extremely skewed -- most ad impressions generate zero revenue, a few generate a lot. The critic can't learn a stable baseline from this distribution.

The paper tested 6 configurations in offline experiments. The table below previews the action space designs we'll explain in detail shortly -- for now, focus on the Pass/Fail pattern:

Config	Action Space	Exploration	Algorithm	Diversity	Relative Gain
1	Continuous $\mathbb{R}^n_+$	Gaussian	Actor-Critic	Fail	Fail
2	Continuous $\mathbb{R}^n_+$	Uniform	Actor-Critic	Fail	Fail
3	Discretized $m^n$	Gaussian	Actor-Critic	Fail	Fail
4	Discretized $m^n$	Uniform	Actor-Critic	Fail	Fail
5	Grouped $g^n$	Uniform	Actor-Critic	Fail	Fail
6	Grouped $g^n$	Uniform	REINFORCE	Pass	Pass

Only one configuration worked: grouped discrete actions + uniform exploration + REINFORCE. Every Actor-Critic variant failed on both offline metrics.

Tip

This is a useful result even if you never build this system. When your RL problem has extremely skewed rewards (common in ads and e-commerce), any method that requires estimating a value function -- including Actor-Critic -- can struggle because the critic sees mostly-zero rewards with occasional outliers. Pure policy gradient methods like REINFORCE that learn directly from rewards without a value estimate may be more robust.

DRL-PUT uses a modified, single-step version of REINFORCE with discount factor $\gamma = 0$ . That makes it a completely myopic agent -- it only optimizes for the immediate reward of each ad request, not a trajectory of future interactions.

This might seem to contradict the series' premise that myopic optimization fails. The difference: in Part 1, the problem was being myopic about what to optimize (clicks instead of long-term engagement). DRL-PUT is myopic about when to optimize (this request only), but it optimizes a multi-objective reward that captures engagement, revenue, and user value simultaneously. For ad ranking, choosing the right objective matters more than planning multiple steps ahead -- and the paper notes that extending to $\gamma > 0$ with long-term rewards is future work.

The update rule is mini-batch averaged:

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \eta \cdot \frac{1}{B} \sum_{i=1}^{B} r_t^{(i)} \cdot \nabla_{\boldsymbol{\theta}} \log \pi(a_t^{(i)} | s_t^{(i)}, \boldsymbol{\theta}_t)

This is vanilla REINFORCE (see Part 2 for the derivation) with two modifications: (1) each sample is a single-step interaction (since $\gamma = 0$ , there are no multi-step trajectories), and (2) gradients are averaged over a mini-batch of $B$ ad requests rather than computed from one sample. The averaging reduces gradient variance -- important because individual ad impressions are noisy signals.

You might notice there's no importance sampling correction, even though the data is off-policy (collected by uniform random exploration). With $\gamma = 0$ and a single step, this is effectively a reward-weighted classification problem. And because the behavior policy is uniform, every action is equally likely to appear in the training data -- no action is systematically over- or under-represented. Combined with the single-step setting (no trajectory-level compounding of probabilities), this eliminates the distribution mismatch that importance sampling normally corrects for.

The result: the model learns to assign higher probability to actions that received higher reward. This differs from the trajectory-level off-policy correction in Part 2, where importance weights correct for the compounding effect of different action probabilities across multiple steps.

#Action Space: Three Iterations to Get Right

With the algorithm chosen, the next challenge was designing a tractable action space. The design went through three iterations, and the story is worth telling because it illustrates a general lesson: start simple, discover what breaks, and constrain further.

Iteration 1 -- Continuous $\mathbb{R}^n_+$ : Let the agent output continuous weight values. This failed because they couldn't get enough training data without a simulator. The model wouldn't converge when the number of training examples was "much smaller than the continuous action space."

Iteration 2 -- Discretized ( $m^n$ ): Partition each weight into $m = 10$ equally spaced values within a predefined range $[w_{i,\min}, w_{i,\max}]$ . With $n$ hyperparameters (easily double digits), this gives $10^n$ possible actions -- still too large for the configurations they tested (though the failures may also stem from Actor-Critic rather than action space size alone).

Iteration 3 -- Grouped discrete ( $g^n$ ): Group semantically related weights together so they always share the same discretization index. For example, $p(\text{click})$ and $p(\text{click30})$ are physically related (both measure click engagement), so their weights move together. But $p(\text{conversion})$ is a different behavior, so it gets its own group. The three groups are:

Click-related weights ( $w_\text{click}$ , $w_\text{click30}$ )
Conversion-related weights ( $w_\text{conversion}$ )
Reserve price threshold ( $b$ )

With 3 groups and 10 values each: $10^3 = 1{,}000$ total actions. This is small enough for a softmax output layer and large enough to express meaningful variation.

Note

The key insight is that semantic grouping reduces dimensionality without losing expressiveness. Weights that should move together (click and long-click) are forced to, which acts as a strong inductive bias. This is similar to weight tying in neural networks -- constraining the model makes it easier to learn.

#State Representation

The action space defines what the agent can do. The state defines what it sees. The state captures three categories of features when an ad request arrives:

User profile: age, gender, metro area
User activities: historical action counts, both on-site (clicked items) and off-site (purchased items)
Contextual: hour of day, day of week, IP country

Notably, the paper excludes real-time supply/demand information (total ad budget in inventory, current user traffic) even though it would help. The reason is practical: "such information is either hard to measure or infeasible to be obtained during serving time." This is a common production constraint -- features that would be useful in theory but impossible to compute at the latency required for real-time ad serving.

#Network Architecture

Given this state representation, the policy model is straightforward: an MLP that maps states to a probability distribution over 1,000 actions:

Each categorical feature passes through its own embedding layer
Embeddings are concatenated with numerical and dense vector features
The combined representation passes through hidden layers: batch normalization -> linear -> ReLU
A final softmax layer outputs $\pi(a|s, \boldsymbol{\theta})$ over all 1,000 discretized actions
At serving time: $a^* = \arg\max_{a \in \mathcal{A}} \pi(a|s, \boldsymbol{\theta})$

One forward pass per ad request. No value function to evaluate, no search over actions. This is what makes REINFORCE practical here -- the policy network directly outputs action probabilities, and you just take the argmax.

#Reward Design: Campaign-Type-Specific

The reward design required more iteration than the architecture. Don't confuse the ranking utility (which scores each ad to determine display order) with the reward (which tells the RL agent how good its chosen weights were). The utility is what the agent outputs -- the reward is the feedback it receives. They share similar components but serve different roles.

A quick primer on ad campaign types, since the reward depends on them: when advertisers create a campaign, they choose what they're willing to pay for. A click-through campaign pays per click (CPC), a conversion campaign pays per purchase or sign-up (CPA), and an impression campaign pays per ad view (CPM). This choice shapes the entire optimization objective -- the platform needs to maximize value differently for each type.

The reward has two components (note: these use predicted probabilities from existing ranking models, not observed user actions -- the reward is computed at serving time before the ad is shown):

r = \text{Estimated\_Revenue} + \text{Estimated\_User\_Value}

Estimated Revenue varies by campaign type:

\text{Estimated\_Revenue} = \begin{cases} p(\text{click}) \cdot \text{bid}_\text{ctr} & \text{click-through campaign} \\ p(\text{conversion}) \cdot \text{bid}_\text{conv} & \text{conversion campaign} \\ \text{bid}_\text{imp} & \text{impression campaign} \end{cases}

Estimated User Value is a weighted combination of engagement probabilities:

\text{Estimated\_User\_Value} = \alpha \cdot p(\text{click}) + \beta \cdot p(\text{click30}) + \gamma \cdot p(\text{conversion})

where $\alpha, \beta, \gamma$ differ by campaign type (note: this $\gamma$ is a reward coefficient, not the discount factor from the REINFORCE equation above):

Campaign Type	$\alpha$ (click)	$\beta$ (click30)	$\gamma$ (conversion)
Click-through	1.0	0.5	0.0
Conversion	0.1	0.4	0.5
Impression	0.0	0.0	0.0

The logic: click-through campaigns care about clicks (obviously), conversion campaigns weight conversions heavily but still want some click signal, and impression campaigns only optimize revenue (the advertiser pays per impression regardless of user behavior).

Warning

Getting the reward wrong is the easiest way to break this system. The paper's ablation study shows that using revenue-only reward ( $r = \text{Estimated\_Revenue}$ ) caused CTR to drop from +9.71% to -0.74%. The agent learned to maximize revenue at the expense of engagement. Campaign-specific reward tuning is not optional.

One more detail that matters in practice: min-max batch normalization on the engagement probabilities ( $p(\text{click})$ , $p(\text{click30})$ , $p(\text{conversion})$ ) "significantly improves the stability and convergence speed of training." For each mini-batch of $B$ samples, each probability is normalized to $[0, 1]$ by subtracting the batch minimum and dividing by the range. Without this, the different probability scales cause training instability.

#Exploration and Data Collection

With the reward function defined, the remaining question is where the training data comes from. DRL-PUT is trained off-policy. Only 0.5% of production traffic is reserved for exploration -- the rest uses the production ranking weights. On that exploration slice, actions are sampled from a uniform distribution over all 1,000 discretized actions.

They tried Gaussian exploration first (centered on the current production weights, so most samples stay close to the status quo). It failed due to "insufficient exploration" -- too many sampled actions clustered near the production values, leaving the agent with almost no data about actions far from the current policy. Uniform random solved this: every action gets roughly equal coverage across the exploration traffic.

The training loop is continuous: (state, action, reward) triplets stream in from online serving logs, and the model trains on them in real time. This is off-policy because the behavior policy (uniform random on the exploration slice) differs from the learned policy.

#Safety Mechanisms

Uniform random exploration over ad weights sounds risky -- bad weight combinations could surface irrelevant ads. Deploying RL in an ad system requires multiple safety layers:

Bounded action space: All weights and the reserve price are constrained to predefined ranges. The agent can't set $w_\text{click} = 10{,}000$ .
Reserve price filtering: The indicator $\mathbf{1}(\text{Estimated\_Revenue} \geq b)$ hard-filters low-revenue ads before ranking, preventing the agent from surfacing ads that generate negligible revenue.
Minimal exploration traffic: 0.5% of requests. If the exploration policy performs poorly, 99.5% of users are unaffected.
Offline evaluation gates: Before a trained model goes to production, it must pass two offline metrics: Diversity (action diversity -- does the model produce varied actions for different states, or has it collapsed to always choosing the same action?) and Relative_Gain (does the candidate policy outperform the behavior policy in reward-weighted expectation?).
Decoupled from ranking models: The RL agent only adjusts weights on top of existing model outputs. It never modifies the click or conversion prediction models themselves. If the RL agent fails, you revert to static weights -- nothing else is affected.

#Personalization Evidence

One of the more interesting results from the paper: the model genuinely personalizes. When users are bucketed by historical CTR, the model increases $w_\text{click}$ while decreasing both $w_\text{conversion}$ and the reserve price $b$ . It learned to show these users more engaging content and apply a lower revenue bar. For high-CVR users, the reverse: higher reserve price (these users are more valuable to conversion-focused advertisers) and higher conversion weight.

A smart product manager would make the same trade-off -- but the model does it per-request rather than per-quarter.

#Production Results

With these design choices in place, here's what Pinterest measured. They deployed DRL-PUT in an online A/B experiment. The "treated segment" is the user population whose ads were ranked using the DRL-PUT model; "platform-wide" includes both treated and control users:

Metric	Platform-wide	Treated Segment
Revenue	+0.27%	-0.16%*
Impressions	+0.02%*	-0.08%*
CTR	+1.62%	+9.71%
CTR30 (long-click)	+1.03%	+7.73%
CVR	+0.67%	+1.26%

* = statistically insignificant

The headline numbers: +9.7% CTR and +7.7% long-click rate on the treated segment, with a statistically insignificant -0.16% revenue change (indistinguishable from zero). The paper notes that "0.5% increase in CTR is considered as a substantial gain" at Pinterest's scale, making the ~10% improvement especially notable.

The revenue result is a predictable consequence of the reward design: the agent learned to favor engaging ads over high-bid-but-boring ones, which is exactly what a reward combining revenue and engagement would produce. Platform-wide (aggregated across all traffic), all metrics improved.

#RecoMind: Simulator-Based RL

Paper: RecoMind: Recommender System Simulation for RL-Based Optimization

Here's the problem with RL in production: you can't experiment freely with real users. A bad policy means degraded user experience, and you don't know it's bad until users suffer.

RecoMind's solution: build a simulator from your existing supervised models, then train RL safely in simulation.

#Simulator Architecture

The simulator predicts how users would respond to any recommendation. It's trained on historical data to mimic real user behavior.

#Bootstrapping from Production

RecoMind doesn't start from scratch. The initial RL policy is the production ranking model. Training then improves on this baseline.

Quick primer on Q-learning (if you're coming from REINFORCE in Part 2): Unlike REINFORCE which learns a policy directly, Q-learning learns a value function $Q(s,a)$ that predicts total future reward. We then act greedily: pick the action with highest $Q$ . The "target network" is a slowly-updated copy of $Q$ used to stabilize training—without it, we'd be chasing a moving target.

Python

# Simplified RecoMind training loop
def train_recomind(simulator, production_policy, num_iterations, gamma=0.75):
    """
    Train RL policy using simulator.

    Args:
        simulator: User behavior simulator
        production_policy: Current production model to bootstrap from
        num_iterations: Number of training iterations
        gamma: Discount factor (0.75 = ~4 step effective horizon, appropriate for
               in-session optimization where sessions are short)
    """
    # Initialize RL policy from production
    rl_policy = copy.deepcopy(production_policy)
    target_network = copy.deepcopy(rl_policy)

    for iteration in range(num_iterations):
        # Sample trajectories in simulator
        states, actions, rewards, next_states = simulator.rollout(rl_policy)

        # TD loss for Q-learning
        with torch.no_grad():
            next_q = target_network(next_states).max(dim=1).values
            # Note: In production, multiply by (1 - done) to zero out bootstrapping at episode ends
            td_target = rewards + gamma * next_q  # Discounted future value

        current_q = rl_policy(states).gather(1, actions.unsqueeze(1))
        loss = F.mse_loss(current_q.squeeze(), td_target)

        # Update policy
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Soft update target network
        soft_update(target_network, rl_policy, tau=0.001)

    return rl_policy

#Exploration Strategy

RecoMind uses a hybrid exploration strategy:

ε-greedy: With probability ε, take a random action
Softmax over top-K: Instead of purely random, sample from a softmax over the top-K actions

This balances exploration (finding new good recommendations) with staying close to known-good options (not deviating too far from production).

#Simulator Validation

A simulator is only useful if it's accurate. RecoMind validates through:

Offline simulation testing: Run 100,000 episodes with the trained policy and measure predicted metrics
Online A/B testing: Deploy to a held-out population (40M users in their case) and verify simulation predictions match online performance
Metric correlation: Ensure offline metric improvements correlate with online improvements

#Results

Pinterest deployed RecoMind-trained policies:

+15.81% long watch time
+4.71% session depth (users view more content)

Simulation enabled testing without risking user experience.

#Trade-offs Comparison

Aspect	DRL-PUT	RecoMind
What it learns	Ranking weights	Full ranking policy
Action space	Discretized weights	Item selection
Training data	Online	Simulated
Safety	Bounded actions	Simulation isolation
Complexity	Moderate	High (requires simulator)

Pinterest chose two complementary approaches for different problems. But your system isn't Pinterest. Let's build a framework for choosing your own path.

#Decision Framework

After covering seven papers, how do you choose the right approach for your system?

Which RL Approach Should You Use?

Answer a few questions about your recommendation system to get a personalized approach recommendation based on the papers covered in this series.

Question 1

What is your primary optimization objective?

#Summary Table

Objective	Recommended Approach	Paper
Immediate engagement, large catalog	REINFORCE + Top-K	YouTube 2019
Need stable training	Actor-Critic	Google 2022
Offline policy evaluation	Control Variates	Netflix 2021
Long-term retention	RLUR dual-critic	Kuaishou 2023
Cold-start / new content	Impatient Bandits	Spotify 2023
Multi-objective balance	DRL-PUT	Pinterest 2025
Fast safe iteration	Simulator RL	Pinterest 2025

#What This Series Taught Us

Looking back across all four parts, the core lesson is this: successful RL for recommendations is about managing the gap between textbook RL and production reality.

Textbook RL assumes: online learning, immediate rewards, single-step actions, stable environments. Production has: logged data, delayed signals, slates of items, evolving user preferences. Every paper we covered addresses one of these gaps:

YouTube REINFORCE: Handles massive action spaces and off-policy data
Google Actor-Critic: Reduces variance for stable training
Netflix Slate OPE: Evaluates slate policies offline
Kuaishou RLUR: Bridges immediate and retention rewards
Spotify Impatient Bandits: Uses partial signals for faster learning
Pinterest DRL-PUT: Dynamically balances multiple objectives
Pinterest RecoMind: Enables safe offline iteration via simulation

The successful papers we covered share a pattern: they explicitly identify which RL assumptions their problem violates, then design around those gaps.

#Starting Recommendation

If you're building RL for recommendations today, here's my recommended approach:

Build a simulator first (RecoMind approach)
- Use your existing supervised models as the foundation
- Validate simulator accuracy carefully
- This enables fast, safe experimentation
Bootstrap from production
- Start with your current ranking model as the initial policy
- RL then learns to improve on this baseline
Use DRL-PUT for multi-objective tuning
- If you have competing objectives (engagement, revenue, fairness)
- Learn to dynamically balance rather than static weights
Add RLUR-style retention optimization
- Once engagement is solid, add retention objective
- Dual critics provide learning signal at both timescales
- Note: This requires substantial infrastructure (retention signal takes days to materialize, needs user-group normalization)
Deploy Impatient Bandits for cold-start
- New content needs exploration
- Bayesian updating accelerates quality discovery

#Implementation Roadmap

#Phase 1: Contextual Bandits Foundation

Before jumping to full RL, validate that learning-based decisions beat heuristics. Start with contextual bandits—they're simpler and faster to debug.

Why start here? If your recommendation problem doesn't benefit from personalized exploration (bandits), it won't benefit from sequential optimization (full RL). Bandits are your sanity check. They answer: "Does intelligent exploration find better items than our current heuristics?" If no, stop here—something is wrong with your reward signal or action space.

What to expect: A well-tuned LinUCB should beat random exploration within a few thousand interactions and approach your best static policy within 10-50K interactions (depending on feature quality). If you're not seeing this, debug before proceeding.

LinUCB intuition: LinUCB predicts rewards as a linear function of features, and tracks uncertainty about those predictions. Think of it this way: the matrix $A$ accumulates information about which features you've seen. Early on, $A$ is small (little information), so $A^{-1}$ is large (high uncertainty). As you observe more data, $A$ grows and $A^{-1}$ shrinks—you become more confident.

The estimate $\theta = A^{-1}b$ gives your best guess of reward weights. The term $\sqrt{x^T A^{-1} x}$ measures how uncertain you are for this specific context. LinUCB adds the uncertainty to the estimate: items you're unsure about get a bonus. This encourages exploration—you try uncertain items to learn more about them.

Python

class LinUCB:
    """
    Linear Upper Confidence Bound bandit for personalized ranking.
    """
    def __init__(self, d, alpha=1.0):
        self.d = d  # Feature dimension
        self.alpha = alpha  # Exploration parameter
        self.A = {}  # Per-arm covariance matrices
        self.b = {}  # Per-arm reward vectors

    def get_arm_params(self, arm):
        if arm not in self.A:
            self.A[arm] = np.eye(self.d)
            self.b[arm] = np.zeros(self.d)
        return self.A[arm], self.b[arm]

    def select_arm(self, context, arms):
        """
        Select arm with highest UCB score.

        Args:
            context: [d] feature vector
            arms: list of available arms

        Returns:
            Selected arm
        """
        best_arm = None
        best_score = float('-inf')

        for arm in arms:
            A, b = self.get_arm_params(arm)
            A_inv = np.linalg.inv(A)

            # Estimated reward
            theta = A_inv @ b
            mean = context @ theta

            # Confidence bound
            std = np.sqrt(context @ A_inv @ context)

            # UCB score
            score = mean + self.alpha * std

            if score > best_score:
                best_score = score
                best_arm = arm

        return best_arm

    # Note: The matrix inverse inside select_arm is O(d^3) per arm.
    # For production, use Sherman-Morrison incremental updates or solve
    # the linear system directly. This code is for pedagogy.

    def update(self, arm, context, reward):
        """Update arm parameters with observed reward.

        Key insight: A accumulates the "information matrix" (sum of x*x^T).
        When we see context x and reward r:
        - A grows by x*x^T (we've seen this direction in feature space)
        - b grows by r*x (we've associated this direction with reward r)

        Then theta = A^{-1}b is the least-squares estimate of reward weights.
        The more observations in a direction, the larger A is in that direction,
        so A^{-1} is smaller -> lower uncertainty -> smaller exploration bonus.
        """
        A, b = self.get_arm_params(arm)
        self.A[arm] = A + np.outer(context, context)  # Information accumulation
        self.b[arm] = b + reward * context  # Reward-weighted direction

Metrics to track:

Regret compared to best-arm-in-hindsight
Exploration rate (how often do you recommend uncertain items?)
Business metrics (engagement, revenue)

#Phase 2: Add REINFORCE

Once bandits show value—meaning they beat your heuristic baseline on a held-out A/B test—graduate to policy gradient methods.

When to make this jump: You should move to REINFORCE when you have evidence that sequence matters. Signs include: (1) bandit performance varies based on what was recently shown, (2) session-level metrics don't match single-decision metrics, (3) users who see diverse content early in sessions engage more later. If decisions are truly independent, stick with bandits—they're simpler and often better.

Prerequisites check: Before implementing REINFORCE, ensure you have: (1) behavior policy probabilities logged for every recommendation, (2) a clear reward signal that captures session-level or long-term value, (3) infrastructure to train on sequences of (state, action, reward) tuples rather than individual examples.

Python

import torch.nn as nn
import torch.nn.functional as F

class REINFORCERecommender(nn.Module):
    """
    REINFORCE recommender with importance sampling and Top-K correction.
    See Part 2 for detailed explanation of these components.
    """
    def __init__(self, state_dim, item_dim, hidden_dim=256):
        super().__init__()
        self.user_encoder = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim)
        )
        self.item_encoder = nn.Sequential(
            nn.Linear(item_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim)
        )

    def compute_policy(self, user_state, items):
        """
        Compute softmax policy over items.

        Args:
            user_state: [batch, state_dim]
            items: [num_items, item_dim]

        Returns:
            policy: [batch, num_items] probabilities
        """
        user_emb = self.user_encoder(user_state)  # [batch, hidden]
        item_emb = self.item_encoder(items)        # [num_items, hidden]

        # Dot product scores
        scores = user_emb @ item_emb.T  # [batch, num_items]

        # Softmax policy
        policy = F.softmax(scores, dim=-1)
        return policy

    def reinforce_loss(self, user_states, items, actions, rewards,
                       behavior_probs, k=16, cap=10.0):
        """
        REINFORCE loss with importance sampling and Top-K correction.
        """
        # Current policy probabilities
        policy = self.compute_policy(user_states, items)
        pi = policy.gather(1, actions.unsqueeze(1)).squeeze()

        # Importance weights (capped)
        weights = (pi / behavior_probs).clamp(max=cap)

        # Top-K correction
        lambda_k = k * (1 - pi) ** (k - 1)

        # Log probability
        log_pi = torch.log(pi + 1e-10)

        # REINFORCE gradient
        loss = -(weights * lambda_k * rewards * log_pi).mean()

        return loss

Key additions over bandits:

Policy is a neural network (richer representations)
Off-policy correction enables learning from logged data
Top-K correction for slate recommendations

#Phase 3: Build Simulator

This is where iteration speed increases dramatically. Instead of waiting weeks for A/B test results, you can test policy changes in minutes.

Python

class UserSimulator:
    def __init__(self, click_model, watch_model, session_model):
        self.click_model = click_model      # P(click | user, item)
        self.watch_model = watch_model      # E[watch_time | user, item, click]
        self.session_model = session_model  # P(continue | user, history)

    def simulate_session(self, user_state, policy, max_steps=20):
        """
        Simulate a user session under a given policy.

        Returns:
            trajectory: list of (state, action, reward, next_state)
        """
        trajectory = []
        state = user_state

        for step in range(max_steps):
            # Policy selects item
            items = self.get_candidate_items(state)
            action = policy.select(state, items)

            # Simulate user response
            click_prob = self.click_model(state, action)
            clicked = np.random.random() < click_prob

            if clicked:
                watch_time = self.watch_model(state, action)
            else:
                watch_time = 0

            reward = watch_time  # Or other reward function

            # State transition
            next_state = self.transition(state, action, clicked, watch_time)

            trajectory.append((state, action, reward, next_state))

            # Check if session continues
            continue_prob = self.session_model(next_state)
            if np.random.random() > continue_prob:
                break

            state = next_state

        return trajectory

Validation is critical:

Compare simulated vs. actual session statistics
Check that policy rankings are preserved

#Phase 4: Add Long-Term Objectives

Following RLUR's approach:

Python

class DualCriticAgent:
    def __init__(self, state_dim, action_dim):
        self.actor = PolicyNetwork(state_dim, action_dim)
        self.critic_immediate = ValueNetwork(state_dim)
        self.critic_retention = ValueNetwork(state_dim)

    def compute_loss(self, batch, alpha=0.3):
        """
        Dual-critic loss combining immediate and retention signals.

        Args:
            batch: (states, actions, imm_rewards, ret_rewards, next_states)
            alpha: weight on retention objective (0.3 is typical; can increase over training)
        """
        states, actions, imm_rewards, ret_rewards, next_states = batch

        # Immediate critic
        v_imm = self.critic_immediate(states)
        v_imm_next = self.critic_immediate(next_states).detach()
        td_imm = imm_rewards + 0.99 * v_imm_next
        loss_imm = F.mse_loss(v_imm, td_imm)

        # Retention critic (longer horizon, sparser updates)
        v_ret = self.critic_retention(states)
        v_ret_next = self.critic_retention(next_states).detach()
        td_ret = ret_rewards + 0.999 * v_ret_next  # Higher gamma for long-term
        loss_ret = F.mse_loss(v_ret, td_ret)

        # Combined advantage for actor
        adv_imm = (td_imm - v_imm).detach()
        adv_ret = (td_ret - v_ret).detach()
        advantage = (1 - alpha) * adv_imm + alpha * adv_ret

        # Actor loss
        log_probs = self.actor.log_prob(states, actions)
        loss_actor = -(advantage * log_probs).mean()

        return loss_actor + loss_imm + loss_ret

#When RL Is the Wrong Tool

Before discussing mistakes, a reality check: RL doesn't always help. Skip RL entirely when:

Decisions are truly independent: If showing item A today doesn't affect what users want tomorrow, bandits are sufficient.
Reward signals are clean and immediate: If you have reliable, instant feedback, simpler methods often win.
Your action space is small and enumerable: With fewer than 100 items, exhaustive exploration is feasible.
You lack behavior policy logging: Without $\beta(a|s)$ , off-policy RL is impossible.

The papers in this series succeeded because their problems genuinely needed sequential optimization. Don't assume yours does—validate with simpler methods first.

Bottom line: If you're unsure, start with a bandit pilot. If bandits don't beat your baseline, RL won't either.

#Common Mistakes (Learn from Others' Pain)

Starting with full RL before validating with bandits: If bandits don't beat heuristics, full RL won't either. Bandits are your sanity check. I've seen teams spend 6 months building a REINFORCE pipeline only to discover their reward signal was too noisy to learn anything. Bandits would have revealed this in 2 weeks.
Optimizing a proxy reward: Watch time is not retention. Clicks are not satisfaction. Make sure your reward signal actually tracks what you care about.
Ignoring exploration in evaluation: Your A/B test metrics will underestimate RL gains because the logging policy didn't explore. Use OPE to estimate what exploration would have found.
Deploying without guardrails: RL policies can find weird solutions. Set hard constraints on diversity, ad load, and content quality before deployment.
Not logging behavior policy probabilities: You can't do off-policy correction without $\beta(a|s)$ . Log it at serving time. This is non-negotiable.

#Troubleshooting Guide

Problem: Training loss is unstable (oscillating or diverging)

Check importance weights: if they're frequently hitting the cap, lower $c$ or investigate why $\pi$ and $\beta$ differ so much
Check for NaN in log-probabilities: ensure you're using log_softmax, not log(softmax(...))
Reduce learning rate by 10x and see if it stabilizes

Problem: Policy collapses to always recommending the same items

Add entropy regularization to the actor loss: -entropy_weight * policy.entropy().mean()
Check if your reward signal has enough variance—constant rewards give no learning signal
Verify exploration is happening: log the entropy of your policy distribution

Problem: Offline metrics look good but online A/B test shows no improvement

Your simulator or OPE may have distribution shift—retrain on more recent data
Check that you're logging $\beta(a|s)$ correctly (a common bug is logging post-softmax instead of pre-softmax)
Verify the action space matches: if production has more/different items, your policy may be misaligned

Problem: Actor-Critic critic loss doesn't decrease

Target network might be updating too fast—decrease $\tau$ to 0.001
Reward scale might be wrong—normalize rewards by their standard deviation
Check for reward delay: if rewards arrive after state transitions, your TD targets are wrong

#Conclusion

We started with a frustrating problem: your CTR model gets better, but nothing that matters improves. Now you have a toolkit to fix it.

Supervised learning predicts; RL decides. When you care about sequences, long-term outcomes, or exploration, prediction alone isn't enough.

But "use RL" is too vague. Now you know: start with a simulator (so you can iterate safely), bootstrap from your production model (so you're improving on a known baseline), and graduate from bandits to REINFORCE to actor-critic as your problem demands.

YouTube reported +0.85% watch time, Pinterest +15.81% long watch time, and Kuaishou statistically significant DAU improvements (which they note translates to ~600K additional daily active users at their scale). These are paper-reported results from production deployments; your mileage will vary.

Remember that flat retention despite better CTR from Part 1? Now you know how to fix it: prediction got you here, but decisions will get you further.

Here's your Monday morning action: Look at your logging infrastructure. Are you recording the probability $\beta(a|s)$ that your current system assigned to each recommendation? If not, you can't do off-policy learning, and everything in this series is theoretical. Fix that first. Then build a user simulator using your existing click and watch-time prediction models. Run 1,000 simulated sessions. That's your sandbox for RL experiments.

Your move.

#Further Reading

Foundational Papers:

Policy Gradient Methods for RL with Function Approximation (Sutton et al., 1999)
Off-Policy Actor-Critic (Degris et al., 2012)

Industry Papers Covered:

#Bonus: Building Your Own Simulator

If you're serious about RL for recommendations, building a simulator pays off quickly. Pinterest's RecoMind shows the payoff: simulation let them test policy changes in minutes instead of waiting weeks for A/B results.

#Why Simulators?

Fast iteration: Run thousands of experiments in hours, not months
Safe exploration: Test risky policies without affecting real users
Counterfactual analysis: Answer "what if" questions about past decisions
Debug and interpret: Understand why policies behave certain ways

#Key Components

1. User Response Models

Train supervised models to predict user behavior:

Python

# Click model: P(click | user, item, context)
click_model = train_binary_classifier(
    features=['user_history', 'item_features', 'context'],
    target='clicked'
)

# Watch time model: E[watch_time | user, item, clicked=True]
watch_model = train_regressor(
    features=['user_history', 'item_features', 'context'],
    target='watch_time',
    filter_condition='clicked == True'
)

2. State Transition Model

How does user state evolve after each interaction?

Python

def transition(user_state, action, response):
    """
    Update user state based on interaction.
    """
    new_state = user_state.copy()

    # Add interaction to history
    new_state['history'].append({
        'item': action,
        'clicked': response['clicked'],
        'watch_time': response['watch_time']
    })

    # Update interest vectors (simplified)
    if response['clicked']:
        item_embedding = get_embedding(action)
        new_state['interest'] = (
            0.9 * new_state['interest'] +
            0.1 * item_embedding
        )

    return new_state

3. Validation Suite

The simulator is only useful if it's accurate:

Python

def validate_simulator(simulator, holdout_data):
    """
    Validate simulator against held-out real data.
    """
    metrics = {}

    # 1. Response prediction accuracy
    predicted_clicks = []
    actual_clicks = []
    for session in holdout_data:
        for state, action, response in session:
            pred = simulator.click_model(state, action)
            predicted_clicks.append(pred)
            actual_clicks.append(response['clicked'])

    metrics['click_auc'] = roc_auc_score(actual_clicks, predicted_clicks)

    # 2. Session length distribution
    # ks_2samp: Kolmogorov-Smirnov test comparing two distributions
    # Returns statistic between 0-1; lower = more similar distributions
    sim_lengths = [len(simulator.simulate_session(s)) for s in holdout_states]
    real_lengths = [len(s) for s in holdout_data]
    metrics['length_ks'] = ks_2samp(sim_lengths, real_lengths).statistic

    # 3. Engagement distribution
    sim_engagement = [sum(r['watch_time'] for r in s) for s in simulated_sessions]
    real_engagement = [sum(r['watch_time'] for r in s) for s in holdout_data]
    metrics['engagement_ks'] = ks_2samp(sim_engagement, real_engagement).statistic

    return metrics

#Common Pitfalls

Distribution shift: The simulator learns from behavior policy data. When evaluating a very different policy, predictions may be unreliable. Solution: Use uncertainty estimates and stay close to the behavior distribution.

Missing confounders: Some factors affecting user behavior aren't in your features (mood, external distractions). Solution: Model aleatoric uncertainty (irreducible randomness in user behavior—some variance can't be explained no matter how good your model) and validate on diverse data.

Reward hacking: RL can find policies that exploit simulator bugs rather than genuinely improving recommendations. Solution: Regularly validate against A/B tests.

Paper	Company	Year	Objective	Approach	Results
YouTube REINFORCE	YouTube	2019	Watch time	Policy Gradient + Top-K	+0.85% watch time
Actor-Critic Rec	Google	2022	User enjoyment	Actor-Critic + Value function	+0.07% (on top of REINFORCE)
Slate OPE	Netflix	2021	Offline evaluation	Control Variates	O(K) variance reduction
RLUR	Kuaishou	2023	User retention	Dual-Critic RL	+0.2% DAU
Impatient Bandits	Spotify	2023	Cold-start	Bayesian + Meta-learning	Near-optimal performance
DRL-PUT	Pinterest	2025	Multi-objective	Utility tuning RL	+9.7% CTR
RecoMind	Pinterest	2025	Session optimization	Simulator-based RL	+15.81% long watch

Series Overview:

Part 1: RL Foundations
Part 2: Production Policy Gradients (YouTube, Google)
Part 3: Advanced Topics (Netflix, Kuaishou, Spotify)
Part 4 (this article): Implementation Guide (Pinterest, Decision Framework)