LLM Serving from Scratch: The Systems Behind Fast Inference
How LLMs are efficiently served in production — from KV cache management and PagedAttention to speculative decoding and prefill-decode disaggregation.
ML Engineer at Roblox w/ 9+ years of experience in recommender systems. I build ML models that help 100M+ daily users find games they want to play. Previously, I built similar models for e-commerce recommendations at Wish.
Side projects:
Current goals:
How LLMs are efficiently served in production — from KV cache management and PagedAttention to speculative decoding and prefill-decode disaggregation.
Why transformer inference is memory-bound, how KV caching eliminates redundant computation, and how FlashAttention achieves 75% GPU utilization through IO-aware tiling.
The first article in a practitioner series on combining prediction scores into a single ranking. Covers the universal pattern, taxonomy of approaches from static weights to RL policies, and when to use each.
Why reinforcement learning matters for recommendations, a refresher on MDP fundamentals, and understanding the spectrum from bandits to full RL.
Five ways to understand AUROC, why AUPRC matters more for imbalanced recsys tasks, and practical guidance for interpreting these metrics in production.
A comprehensive guide to understanding and mitigating the six major biases that affect recommendation systems: selection, position, exposure, popularity, conformity, and feedback loops.