You just deployed a 70B parameter LLM on 8 H100 GPUs. The model weights consume 140 GB of memory. You start serving requests and throughput looks great -- until you hit 20 concurrent users with 4K context windows, and suddenly your GPUs are out of memory. But wait: 20 users times 4K tokens is only 80K tokens. How can that consume more memory than the 70 billion parameter model itself?
The answer is the KV cache. And understanding it is the key to understanding everything about LLM serving.
I wrote this article because I kept running into the same pattern: teams deploy an LLM, hit a mysterious memory wall, and then scramble to figure out why their $200K GPU cluster can only handle a dozen users. The serving stack has matured enormously in the past three years -- from Orca's continuous batching in 2022 to Mooncake's disaggregated architecture winning Best Paper at FAST 2025 -- but most of the knowledge lives scattered across academic papers and framework docs. This article brings it together.
What you'll learn:
Prerequisites: Familiarity with the transformer attention mechanism (), autoregressive text generation (token-by-token), and basic GPU concepts (what HBM (High Bandwidth Memory) is, why memory bandwidth matters). Comfortable reading Python pseudocode.
Skip-ahead guide: If you already understand KV caching and memory management, jump to Section 5 (Request Scheduling) for the systems-level techniques. If you're evaluating serving frameworks, jump to Section 11 (Framework Comparison).
Serving an image classifier is simple. A request comes in, you run one forward pass, return the result, free memory. The GPU holds the model weights and a batch of inputs -- both fixed size. Memory is predictable.
LLM serving breaks every assumption that makes traditional serving easy:
| Property | Image Classifier | LLM |
|---|---|---|
| Forward passes per request | 1 | 100-10,000+ (one per output token) |
| Memory per request | Fixed (input size) | Growing (KV cache accumulates) |
| Request lifetime | Milliseconds | Seconds to minutes |
| State between steps | None | KV cache (potentially GBs) |
| Output length | Fixed | Unknown until end-of-sequence (EOS) token |
An image classifier processes a batch in one shot and frees memory. An LLM request generating 500 tokens holds GPU memory across 500 forward passes, and you cannot predict how many tokens it will generate before it starts. This makes LLM serving fundamentally a resource management problem, not just a compute problem.
Each token depends on all previous tokens. There is no way to parallelize generation within a single request -- token 5 cannot be generated until tokens 1-4 exist. Generating N tokens requires N serial forward passes.
This has a direct consequence: latency is proportional to output length. If each forward pass takes 30ms, generating 200 tokens takes 6 seconds. No amount of GPU compute changes this sequential dependency. You can make each step faster, but you cannot skip steps.
To put this in perspective: an image classifier processes a 224x224 image in a single 5ms forward pass. An LLM generating a 500-word response (~750 tokens) at 30ms per token takes 22.5 seconds -- 4,500x longer. And during those 22.5 seconds, the request's KV cache sits in GPU memory, blocking other requests from using that space.
This sequential nature also means the generation phase is embarrassingly serial within each request. If you have a GPU with 989 TFLOPS of compute (H100), most of that compute power sits idle during single-request decode because the work per step is so small. A single matrix-vector multiply for a 70B model is about 140 billion FLOPs -- which an H100 completes in less than 0.2ms. Loading 140 GB of weights at 3.35 TB/s takes about 42ms. In practice, with partial overlap and optimization, each step takes roughly 30ms -- the vast majority of which is spent waiting for memory.
(The one exception is speculative decoding, which we cover in Section 8. It generates multiple tokens per step by cleverly guessing ahead.)
Here is the most underappreciated fact about LLM serving: a single serving request involves two fundamentally different computations.
A GPU has two finite resources: compute units (measured in TFLOPS) and memory bandwidth (measured in TB/s). An operation is compute-bound when the compute units are the bottleneck -- the GPU finishes reading data faster than it can crunch the numbers. An operation is memory-bound when the memory bandwidth is the bottleneck -- the compute units sit idle waiting for data to arrive from HBM. Which regime you are in depends on the operation's arithmetic intensity: the ratio of compute operations to bytes transferred. For an H100, the crossover point is about 295 FLOPs/byte (989 TFLOPS / 3.35 TB/s). Operations above this ratio are compute-bound; below it, memory-bound.
Prefill (processing the input prompt): The model processes all input tokens in parallel. This is a large matrix-matrix multiplication. The GPU's compute cores are busy. The bottleneck is FLOPS -- this is compute-bound.
Decode (generating output tokens): The model generates one token at a time. Each step is a matrix-vector multiplication (one new token times the weight matrices). The bottleneck is loading model weights from HBM for each tiny computation -- this is memory-bandwidth-bound.
| Characteristic | Prefill | Decode |
|---|---|---|
| Computation type | Matrix-matrix multiply | Matrix-vector multiply |
| Bottleneck | Compute (FLOPS) | Memory bandwidth (GB/s) |
| Arithmetic intensity | High (~100+ FLOPs/byte) | Low (~1-10 FLOPs/byte) |
| Tokens processed | Hundreds to thousands at once | One at a time |
| GPU utilization | 60-80% | 5-30% |
| Ideal batch size | Small (1-4) to saturate compute | Large (64-256+) to amortize weight loading |
In my experience, this dual nature of LLM serving is the single most underappreciated aspect by teams new to deployment.
The Central Tension
Prefill and decode want opposite things from the hardware. This single observation shapes every architectural decision in LLM serving -- from batching strategies to parallelism to the emerging trend of running them on separate hardware entirely.
Prefill wants few requests with lots of tokens each; decode wants many requests to amortize the cost of loading weights. A single serving system must handle both simultaneously, and that tension drives most of the architectural decisions in this article.
Serving Metrics Glossary
These metrics appear throughout this article and in every LLM serving paper:
GPU memory during LLM serving is consumed by four things:
For a Llama 3 70B model in FP16:
The fundamental tension: on 8 H100s (640 GB total HBM), after storing the 140 GB model, you have ~500 GB for KV caches. At 4K context, that supports ~400 concurrent requests. At 128K context, that supports ~12.
Let me put this in dollar terms. An 8xH100 node costs roughly 2,000/month per concurrent user slot, just for the hardware. At 4K context, it is $62/month per slot. Long context is expensive not because the model is bigger, but because each user's KV cache is bigger.
This is why I say LLM serving is a memory management problem, not a compute problem. The KV cache is the scarce resource that every technique in this article ultimately manages.
Figure 1: The LLM serving memory equation. Adjust model size, context length, batch size, and precision to see how KV cache grows to dominate GPU memory. The stacked breakdown shows weights (fixed) vs. KV cache (variable) vs. other overhead.
My attention optimization article explains how KV caching works algorithmically -- why caching keys and values avoids redundant computation and reduces generation from O(N^2) to O(N) total work. Here, we examine the KV cache from a different angle: its systems implications. How much memory does it actually consume? Why does naive management waste 60-80% of that memory? And how does this shape every architectural decision in a serving system?
During autoregressive generation, the model attends to all previous tokens at each step. Without caching, this means recomputing the key (K) and value (V) projections for every previous token at every step -- O(N^2) total computation for N tokens.
The KV cache stores previously computed K and V tensors, so each step only computes K and V for the single new token and appends them to the cache. This reduces total computation to O(N). For details, see the attention optimization article.
The trade-off is memory. Every token in the cache consumes memory, and that memory is held for the request's entire lifetime. For a 70B model with 80 layers, each cached token requires:
A 4K-token conversation holds 1.25 GB of KV cache. A 128K-token document analysis holds 40 GB. And this is per request -- multiply by your concurrent request count to get the total memory pressure.
The KV cache stores two tensors (K and V) per layer, per token, for the KV heads:
Where = number of layers, = number of KV heads, = head dimension, = bytes per element (2 for FP16), = sequence length, = batch size.
Let me compute this for real models:
def kv_cache_memory_gb(layers, kv_heads, head_dim, seq_len, batch_size=1, precision_bytes=2):
"""KV cache memory in GB for a single request or batch."""
bytes_total = 2 * layers * kv_heads * head_dim * precision_bytes * seq_len * batch_size
return bytes_total / (1024**3)
# Llama 3 8B: 32 layers, 8 KV heads (GQA), d_h=128
print(f"Llama 3 8B @ 4K: {kv_cache_memory_gb(32, 8, 128, 4096):.2f} GB")
print(f"Llama 3 8B @ 128K: {kv_cache_memory_gb(32, 8, 128, 131072):.2f} GB")
# Llama 3 70B: 80 layers, 8 KV heads (GQA), d_h=128
print(f"Llama 3 70B @ 4K: {kv_cache_memory_gb(80, 8, 128, 4096):.2f} GB")
print(f"Llama 3 70B @ 128K: {kv_cache_memory_gb(80, 8, 128, 131072):.2f} GB")
# Llama 3 405B: 126 layers, 8 KV heads (GQA), d_h=128
print(f"Llama 3 405B @ 4K: {kv_cache_memory_gb(126, 8, 128, 4096):.2f} GB")
print(f"Llama 3 405B @ 128K:{kv_cache_memory_gb(126, 8, 128, 131072):.2f} GB")
| Model | Layers | KV Heads | KV per request @ 4K | KV per request @ 128K |
|---|---|---|---|---|
| Llama 3 8B | 32 | 8 (GQA) | 0.50 GB | 16 GB |
| Llama 3 70B | 80 | 8 (GQA) | 1.25 GB | 40 GB |
| Llama 3 405B | 126 | 8 (GQA) | 1.97 GB | 63 GB |
Notice something: Llama 3 70B has 64 attention heads but only 8 KV heads. This is Grouped-Query Attention (GQA, Ainslie et al., 2023), where groups of query heads share key-value heads. Without GQA, the 70B model would use 64 KV heads and the cache would be 8x larger -- 10 GB per request at 4K context. GQA is not just a training trick; it is a serving requirement. Multi-Query Attention (MQA, Shazeer, 2019) takes this further, sharing a single KV head across all query heads.
On 8 H100s (640 GB total HBM), a Llama 3 70B model in FP16 uses 140 GB for weights, leaving ~500 GB for KV caches (ignoring activations and overhead for simplicity).
This is the fundamental trade-off: throughput vs. context length. You cannot have both. The relationship is a strict inverse: double the context length, halve the maximum batch size.
This explains why long-context applications (128K-1M token contexts) are so expensive to serve. A user asking questions about a long document consumes 30-100x more KV cache memory than a chatbot conversation. And unlike model weights, which are shared across all requests, each request's KV cache is unique and cannot be shared (except through prefix caching, Section 10).
Here is how this plays out in terms of throughput. A serving system with batch size B generates B tokens per step. If each step takes 30ms:
That is a 33x throughput reduction from 4K to 128K context. The model's compute cost per token is the same -- all the lost throughput comes from memory constraints.
Every system in this article grapples with this constraint. Techniques like quantization (Section 7) shrink the cache to push the boundary; PagedAttention (Section 4) eliminates waste so you can fill the boundary more efficiently; disaggregation (Section 9) lets you use different hardware for the two phases; and prefix caching (Section 10) avoids recomputing shared prefixes entirely.
The simplest approach: when a request arrives, pre-allocate a contiguous chunk of GPU memory for its KV cache at the maximum possible sequence length. When the request finishes, free the chunk.
This wastes memory in three ways:
Reservation waste: You allocate for max_seq_len (say 2048 tokens) even if the request only generates 50 tokens. Memory for 1998 unused token positions sits idle.
Internal fragmentation: Even within the allocated region, the cache only fills as tokens are generated. A request that will eventually use 500 tokens starts with 499 empty slots.
External fragmentation: As requests arrive and depart, the free memory becomes a patchwork of non-contiguous holes. A new request needing 1 GB of contiguous memory might fail even if 2 GB is free -- because it is in 50 scattered 40 MB chunks.
The vLLM paper (Kwon et al., 2023) measured this: existing systems waste 60-80% of KV cache memory to fragmentation. That means on our 500 GB budget, only 100-200 GB is actually used for live tokens. This waste directly limits throughput.
The solution is PagedAttention (Section 4), which borrows virtual memory concepts from operating systems to eliminate nearly all of this waste.
During decode, each request generates one token per step. The computation is a single matrix-vector multiply: the model weights times one token embedding. On an H100 with 3.35 TB/s bandwidth and 989 TFLOPS, the balance point is ~295 FLOPs/byte. A matrix-vector multiply has arithmetic intensity of about 1-2 FLOPs/byte -- the GPU loads the full weight matrices and barely uses them.
GPU utilization during single-request decode is often below 5%. The hardware sits idle, waiting for data to move.
Batching multiple requests converts memory-bound matrix-vector multiplies into compute-bound matrix-matrix multiplies. If you batch 64 requests, the GPU loads the weight matrices once and multiplies them by 64 tokens simultaneously. Same data transfer, 64x more useful compute. Throughput scales nearly linearly with batch size -- until you run out of KV cache memory.
The simplest batching strategy: collect B requests, process them as a batch, wait until all finish. This is how traditional ML serving works -- it processes a batch of images, returns all results, and moves on.
For LLMs, the problem is variable-length generation. Unlike image classification where every input produces one output, LLM requests generate different numbers of tokens. Consider a batch of 8 requests with output lengths [10, 15, 20, 50, 100, 150, 200, 500]:
Let me quantify this. Total useful computation: token-steps. Total computation performed: token-steps (every slot runs for all 500 steps). Average GPU utilization: . And this is just compute waste -- the KV cache memory for finished requests is also wasted.
From the user's perspective: the person who asked a simple yes/no question (10 tokens) waited for the person who asked for a 500-word essay. Both get their responses at the same time, despite the 50x difference in output length.
The Orca paper (Yu et al., OSDI 2022) introduced the key insight: schedule at the iteration level, not the request level. When a request finishes (emits an EOS token), immediately replace it with a waiting request. No padding, no waiting.
Here is the iteration-level scheduling loop:
# Simplified continuous batching scheduler
active_batch = []
waiting_queue = deque()
while True:
# Step 1: Run one forward pass on the active batch
outputs = model.forward(active_batch)
# Step 2: Check for completed requests
for request in active_batch:
if request.last_token == EOS:
send_response(request)
active_batch.remove(request)
# Step 3: Fill empty slots with waiting requests
while len(active_batch) < max_batch_size and waiting_queue:
new_request = waiting_queue.popleft()
run_prefill(new_request) # Process input prompt
active_batch.append(new_request)
The batch is always full (or as full as the queue allows). GPU compute is never wasted on finished requests.
An important subtlety: different requests in the batch may be at very different points in their generation. Request A might be 500 tokens into generation while Request B just started. They have different KV cache lengths, which means the attention computation differs per request. Orca handles this with selective batching: operations that depend on sequence length (attention layers) process each request individually, while operations that are independent of position (MLP layers, layer norms) batch across all requests.
Anyscale reported 23x throughput improvement over static batching in a 2023 blog post. The gain comes from two sources: (1) no wasted compute on finished requests, and (2) the batch stays full, maximizing GPU utilization during the memory-bound decode phase. The actual number depends on the variance in output lengths -- uniform lengths (like fixed-format tasks) show smaller gains, while highly variable lengths (like open-ended chat) show larger gains.
The 23x Number in Context
The 23x figure assumes a baseline of static batching with high output length variance. Real-world gains over a well-tuned static batching baseline are typically 5-15x. But compared to no batching at all, continuous batching can deliver 50-100x throughput improvement.
8 requests, batch size 4. Numbers = request ID. Darker = prefill, lighter = decode.
Figure 2: Static vs. continuous batching. In static batching (top), finished requests waste their slots until the entire batch completes. In continuous batching (bottom), finished requests are immediately replaced, keeping the GPU fully utilized.
Continuous batching has a subtle problem. When a new request enters the batch, its prompt must be prefilled -- all input tokens processed in a single forward pass. Prefill is compute-heavy: a 2000-token prompt generates 2000 KV pairs in one step.
During this prefill, all other requests in the batch must wait. Their decode steps (generating one token each) are blocked by the long prefill operation. This creates what the Sarathi-Serve paper (Agrawal et al., OSDI 2024) calls the "prefill bubble" -- a stall in decode throughput caused by a single request's prefill.
The concrete impact: if prefilling a 2000-token prompt takes 80ms and your decode TPOT target is 30ms, every request in the batch misses its TPOT deadline during that prefill step. With frequent new request arrivals, decode requests experience unpredictable latency spikes.
This motivates two solutions:
Chunked prefill, introduced by SARATHI (Agrawal et al., 2023), splits long prefill prompts into fixed-size chunks (say, 512 tokens) and processes one chunk per iteration alongside the ongoing decode requests.
The key idea is decode-maximal batching: each iteration contains one prefill chunk plus as many decode requests as will fit. The prefill chunk saturates the GPU's compute capacity, and the decode requests "piggyback" at almost no additional cost because decode is memory-bound and the compute is free.
Without chunked prefill:
Step 1: [Prefill 2000 tokens] ← decode requests BLOCKED
Step 2: [Decode 64 requests]
Step 3: [Decode 64 requests]
With chunked prefill (chunk_size=512):
Step 1: [Prefill chunk 1 (512 tokens)] + [Decode 64 requests]
Step 2: [Prefill chunk 2 (512 tokens)] + [Decode 64 requests]
Step 3: [Prefill chunk 3 (512 tokens)] + [Decode 64 requests]
Step 4: [Prefill chunk 4 (512 tokens)] + [Decode 64 requests]
Decode requests are never blocked. The prefill takes 4 steps instead of 1, but the total system throughput is higher because decode never stalls. Sarathi-Serve extended this with stall-free scheduling, achieving up to 5.6x higher serving capacity on Falcon-180B compared to vLLM.
Chunked prefill is now standard. Both vLLM and SGLang have adopted it -- this is not an experimental technique.
Key Insight
Chunked prefill exploits the fact that prefill (compute-bound) and decode (memory-bound) use different GPU resources. Running them together in the same iteration uses both the compute units and the memory bandwidth simultaneously -- like running two different workloads that share the same hardware without competing.
As we saw in Section 2.4, naive contiguous allocation wastes 60-80% of KV cache memory. Let me make this concrete.
Consider 8 requests on an 80 GB GPU. You pre-allocate 2048 token positions per request. Actual token counts at a given moment: [128, 256, 512, 1024, 156, 300, 890, 1800].
And this is the good case -- when requests fit neatly into memory. The real problem comes when a 9th request arrives and needs a 2048-position contiguous block, but the free memory is scattered across dozens of non-contiguous gaps left by completed requests.
Operating systems faced this exact problem decades ago. Early computers allocated contiguous physical memory to each process. When processes terminated, memory became fragmented. The solution -- invented in the 1960s -- was virtual memory with paging.
The idea: give each process a virtual address space that looks contiguous. Behind the scenes, the OS maps virtual pages to physical frames scattered anywhere in RAM. A page table tracks the mapping. The process thinks it has one big contiguous block; physical memory can be in a hundred pieces.
PagedAttention applies this same idea to KV caches.
PagedAttention (Kwon et al., SOSP 2023) divides KV cache memory into fixed-size blocks -- typically 16 tokens per block. Each request gets a block table that maps logical positions to physical GPU memory blocks.
Here is how it works step by step:
# Simplified PagedAttention block allocation
class PagedKVCacheManager:
def __init__(self, num_blocks, block_size=16):
self.block_size = block_size
self.free_blocks = list(range(num_blocks)) # Pool of physical blocks
self.block_tables = {} # request_id -> [physical_block_ids]
def allocate(self, request_id):
"""Allocate a new block for a request."""
if not self.free_blocks:
return None # Out of memory -- trigger preemption
block = self.free_blocks.pop()
self.block_tables.setdefault(request_id, []).append(block)
return block
def free(self, request_id):
"""Free all blocks belonging to a finished request."""
blocks = self.block_tables.pop(request_id, [])
self.free_blocks.extend(blocks) # Return blocks to pool
Let me walk through a concrete example. Say we have GPU memory divided into 1000 blocks of 16 tokens each (16,000 token positions total). Three requests arrive:
Request A (prompt: 100 tokens, generating): Needs ceil(100/16) = 7 blocks initially. The block table maps logical blocks [0, 1, 2, 3, 4, 5, 6] to physical blocks [42, 817, 3, 291, 550, 12, 666]. These physical blocks can be anywhere in GPU memory -- they do not need to be contiguous.
Request B (prompt: 50 tokens, generating): Gets 4 blocks: logical [0, 1, 2, 3] mapped to physical [100, 201, 99, 500].
Request A finishes: Its 7 blocks return to the free list. Physical blocks [42, 817, 3, 291, 550, 12, 666] are now available for any new request.
Request C (prompt: 200 tokens) arrives: Gets 13 blocks from the free list. Some might be from Request A's freed blocks, some from blocks that were always free. The request does not care -- it sees a contiguous logical address space.
The attention kernel in vLLM uses the block table to look up where each logical block lives in physical memory. This adds a level of indirection (one table lookup per block), but the overhead is minimal compared to the memory savings.
Near-zero waste: The only wasted memory is in the last block of each request, which may not be full. With 16 tokens per block, the average waste per request is 8 tokens -- less than 4% for any reasonable sequence length. For a request with 1000 tokens, that is 8/1000 = 0.8% waste.
External fragmentation eliminated: Blocks are scattered in physical memory by design. A new request can always use available blocks regardless of their physical locations. There is no concept of "not enough contiguous space."
24 memory blocks, 4 tokens/block. Each square = 1 block.
Figure 3: PagedAttention memory management. Each request's KV cache is stored in non-contiguous blocks mapped through a block table. When requests finish, their blocks return to the free list immediately. Compare this to contiguous allocation, where fragmentation accumulates over time.
Beam search and parallel sampling generate multiple output sequences from the same prompt. Without PagedAttention, you must duplicate the prompt's KV cache for each sequence -- a huge waste since they share the same prefix.
PagedAttention supports copy-on-write (another OS concept): all sequences sharing a prefix point to the same physical blocks. Only when a sequence diverges (writes a different token to a position) is the affected block copied. The vLLM paper reports up to 55% memory savings for beam search.
I think PagedAttention is one of the most elegant ideas in recent systems research -- taking a 50-year-old OS concept and applying it to a completely new domain. The insight is that KV caches have the same properties as process memory: variable-length, dynamic, and fragmentation-prone. Once you see the analogy, the solution follows naturally.
vLLM with PagedAttention achieves:
These numbers are from the original paper. In practice, the gains are largest when serving many concurrent requests with variable-length outputs -- exactly the common production scenario.
Traditional web servers are stateless: each request arrives, gets processed, and the server forgets about it. LLM requests are different -- each one holds a KV cache in GPU memory for its entire lifetime. Admitting a new request means committing memory that will be occupied for seconds or minutes.
The scheduler must balance four competing objectives:
First-Come-First-Served (FCFS): Process requests in arrival order. Simple, fair, but suffers from head-of-line blocking. A long request (say, "write me a 5000-word essay") delays all subsequent short requests ("translate this sentence").
Shortest-Job-First (SJF): Prioritize short requests to minimize average latency. The problem: we do not know the output length ahead of time. The user's prompt gives some hints (a translation request is probably shorter than an essay request), but the actual output length is only known when the model emits EOS.
In practice, systems use FCFS as the baseline with two escape valves: preemption (evict a running request when memory is full) and priority scheduling (bump certain requests ahead of others based on heuristics or business logic).
When a new high-priority request arrives and GPU memory is full, the scheduler has two options:
Option 1: Swapping. Move a lower-priority request's KV cache to CPU memory via PCIe (the standard CPU-to-GPU bus). When the request can resume, copy it back. The cost:
Option 2: Recomputation. Discard the lower-priority request's KV cache entirely. When it resumes, re-run its prefill from scratch. The cost:
The decision is straightforward: for short contexts (under ~4K tokens), recomputation is usually faster because the prefill is cheap and you avoid the round-trip transfer. For long contexts (over ~8K tokens), swapping wins because recomputation is expensive. Between 4K and 8K, it depends on your PCIe bandwidth and GPU compute speed.
vLLM supports both strategies and can be configured to choose automatically based on context length.
5 requests with different arrival times and generation lengths. Dashed = waiting. Solid = processing.
Figure 4: Request scheduling under different policies. Notice how FCFS leads to head-of-line blocking (short requests wait for long ones), while preemptive scheduling can evict and swap requests to maintain latency targets.
In production, serving systems have SLOs: "99th percentile TTFT < 500ms" or "TPOT < 50ms". The scheduler's job is not just to maximize throughput -- it is to maximize throughput within SLO constraints (this is what the DistServe paper calls "goodput").
Mooncake (Qin et al., FAST 2025) uses prediction-based early rejection to maintain SLO compliance (detailed in Section 9.6).
Sarathi-Serve uses stall-free scheduling that prevents prefill from blocking decode, as we discussed in Section 3.5. By guaranteeing that decode requests always make progress, it can provide TPOT guarantees even under heavy prefill load.
A 70B model at FP16 = 140 GB. One H100 = 80 GB HBM. The model does not fit.
Even with INT4 quantization (35 GB), you still need room for KV caches, activations, and the CUDA runtime. Multi-GPU serving is almost always required for models above ~13B parameters if you want reasonable throughput.
Tensor parallelism splits each layer's weight matrices across GPUs. Each GPU holds a "slice" of every layer and processes its slice of every token.
The technique was formalized by Megatron-LM (Shoeybi et al., 2020). Here is how it works for the attention and MLP layers:
MLP parallelism: For a two-layer MLP (), matrix is split column-wise across GPUs. Each GPU computes its portion of the hidden dimension independently. Then matrix is split row-wise, and each GPU computes its portion of the output. A single all-reduce (a distributed operation where every GPU sums its partial results so all end up with the complete output) after the second matmul synchronizes the results across GPUs.
TP=4 MLP example:
GPU 0: X @ A[:, 0:H/4] → GeLU → @ B[0:H/4, :] → partial_output_0
GPU 1: X @ A[:, H/4:H/2] → GeLU → @ B[H/4:H/2, :] → partial_output_1
GPU 2: X @ A[:, H/2:3H/4] → GeLU → @ B[H/2:3H/4, :] → partial_output_2
GPU 3: X @ A[:, 3H/4:H] → GeLU → @ B[3H/4:H, :] → partial_output_3
↓
all-reduce (sum)
↓
final output
Attention parallelism: The attention heads are split across GPUs. With 64 heads and TP=8, each GPU handles 8 heads. The QKV projections and output projection are partitioned accordingly. Each GPU computes attention for its heads independently, then an all-reduce combines the results.
The communication cost is 2 all-reduces per layer (one for attention, one for MLP). For a 70B model with 80 layers, that is 160 all-reduces per forward pass. On NVLink (NVIDIA's high-speed GPU-to-GPU interconnect, 900 GB/s bidirectional on H100), each all-reduce for a small activation tensor takes about 10-50 microseconds. The total communication overhead is a few milliseconds -- acceptable within a node but prohibitive across nodes where bandwidth drops to 50-100 GB/s.
Pipeline parallelism splits layers across GPUs: GPU 0 gets layers 0-19, GPU 1 gets layers 20-39, and so on. Each GPU processes its layers and passes the result (the hidden state -- a relatively small tensor) to the next GPU.
This was pioneered by GPipe (Huang et al., NeurIPS 2019), which introduced micro-batch pipelining: instead of processing one full batch sequentially through the pipeline (which leaves most GPUs idle), GPipe splits the batch into micro-batches that flow through the pipeline concurrently.
Naive PP (one micro-batch):
GPU 0: [████████] [idle] [idle] [idle]
GPU 1: [idle] [████████] [idle] [idle]
GPU 2: [idle] [idle] [████████] [idle]
GPU 3: [idle] [idle] [idle] [████████]
↑ 75% idle!
Micro-batch PP (4 micro-batches):
GPU 0: [mb1][mb2][mb3][mb4]
GPU 1: [mb1][mb2][mb3][mb4]
GPU 2: [mb1][mb2][mb3][mb4]
GPU 3: [mb1][mb2][mb3][mb4]
↑ much less idle time
The communication is minimal: only the activation tensor (batch_size x hidden_dim, typically a few MB) needs to pass between adjacent stages. This makes PP work well over slower inter-node interconnects like InfiniBand (a high-bandwidth data center networking fabric, 100-400 Gb/s).
Megatron-LM v2 (Narayanan et al., SC 2021) showed how to combine TP within a node and PP across nodes, achieving efficient scaling to thousands of GPUs for both training and inference.
When context lengths exceed what fits in a single GPU's KV cache memory, you can split the sequence across GPUs. Each GPU stores a portion of the KV cache and processes attention for its segment.
The key mechanism is ring attention: GPUs are arranged in a logical ring. Each GPU computes attention for its local KV cache segment, then passes its K/V blocks to the next GPU in the ring. After a full rotation, every GPU has attended to the complete sequence. The communication pattern is elegant -- only nearest-neighbor communication is needed, and it can overlap with computation.
Note: the term "sequence parallelism" is sometimes used in the Megatron-LM literature to refer specifically to partitioning LayerNorm and Dropout along the sequence dimension. Here we use it more broadly to describe any technique that splits the sequence across GPUs, including context parallelism and ring attention (Liu et al., 2023).
Sequence parallelism is primarily useful for very long contexts (100K+ tokens) where the KV cache for a single request does not fit in one GPU's memory, even after the model weights are distributed via tensor parallelism. For example, Llama 3 70B at 128K context requires 40 GB of KV cache per request in FP16 -- with TP=4, each GPU holds 10 GB of KV cache for just one request, leaving little room for batching.
Mixture-of-Experts (MoE) models like DeepSeek V3 (671B total parameters, 256 experts, 37B active per token) need special treatment. The model has 256 expert MLPs, but each token only activates 8 of them. Storing all 256 experts on every GPU wastes memory -- most experts sit idle for any given token.
Expert parallelism assigns each GPU a subset of experts. When a token is routed to an expert on another GPU, the token embedding is sent to that GPU via all-to-all communication (each GPU sends data to every other GPU). The expert processes the token and sends the result back.
The communication pattern is expensive -- all-to-all requires every GPU to send data to every other GPU, which can overwhelm network bandwidth. This is why EP is typically used across nodes (where you have fast InfiniBand) while TP handles the dense layers within a node.
DeepSeek V3 uses a combination: TP=4 within each 8-GPU node for the attention layers, and EP across nodes for the expert layers. The routing decisions (which tokens go to which experts) are made by a learned gating network that runs on all GPUs.
| Scenario | Recommendation | Why |
|---|---|---|
| Model fits in 1 GPU | No parallelism needed | Simplest |
| Model fits in 1 node (2-8 GPUs) | TP within node | Low latency, NVLink available |
| Model spans multiple nodes | TP within node + PP across nodes | PP tolerates slower inter-node links |
| 100K+ context on 70B model | TP + SP | KV cache too large for one GPU |
| MoE model (DeepSeek V3, Mixtral) | TP + EP | Experts need dedicated GPUs |
| Disaggregated serving | Different strategies per phase | See Section 9 |
In my opinion, the DistServe insight -- that prefill and decode benefit from different parallelism strategies -- is one of those ideas that seems obvious in hindsight but fundamentally changes how you design serving systems. Prefill wants high TP (reduce latency of one big computation). Decode wants lower TP or PP (maximize throughput of many small computations).
Split each layer's weight matrices across GPUs. All GPUs compute every layer in parallel, communicating after each one.
Co-located GPUs on the same node. Latency-sensitive prefill.
Figure 5: Parallelism strategies compared. Tensor parallelism splits each layer across GPUs (requiring all-reduce after every layer). Pipeline parallelism assigns different layers to different GPUs (requiring communication only between stages). The optimal strategy depends on your model size, hardware topology, and serving requirements.
Parallelism spreads the model across GPUs. Quantization shrinks the model so you need fewer GPUs -- or can fit more requests on the same ones.
Quantization reduces numerical precision (e.g., FP16 to INT4 or FP8). For serving, the benefits hit at every level:
Keep activations in FP16, quantize only the weights to INT4 or INT8. During matrix multiplication, weights are dequantized on the fly.
Why it works during decode: Decode is memory-bound -- the bottleneck is loading weights from HBM. Smaller weights transfer faster. The dequantization cost is minimal compared to the memory bandwidth savings.
Why it's less effective during prefill: Prefill is compute-bound -- you're already saturating the compute units. Smaller weights don't help because memory bandwidth isn't the bottleneck, and dequantization adds overhead.
GPTQ (Frantar et al., ICLR 2023): Uses Hessian-based optimization to quantize weights layer by layer. The first method to push 4-bit quantization on 175B parameter models with negligible quality loss. Achieves 3.25x speedup on A100.
AWQ (Lin et al., MLSys 2024 Best Paper): Observes that not all weights are equally important -- protecting 1% of salient channels (identified by activation magnitudes) greatly reduces quantization error. More robust than GPTQ across models and tasks. Comparable or faster inference speed than GPTQ with better quality retention.
GGUF/llama.cpp: Per-tensor quantization in a variety of formats (Q4_K_M, Q5_K_S, etc.) optimized for CPU inference. The standard for running models locally on laptops and desktops.
Quantizing both weights and activations enables INT8 or FP8 Tensor Core operations -- actual compute speedup, not just memory savings.
SmoothQuant (Xiao et al., ICML 2023): The challenge is that activations have outliers ~100x larger than typical values. If the largest activation value is 100 and the smallest is 0.01, INT8 quantization must map the range [0.01, 100] into 256 levels -- wasting most of the precision on the outlier range while values near 0.01 get crushed to zero.
SmoothQuant's solution: divide each activation channel by a smoothing factor , and multiply the corresponding weight channel by . This is mathematically equivalent (the factors cancel in the matrix multiply), but the activation range shrinks while the weight range expands:
The smoothed activations have smaller outliers (easier to quantize). The scaled weights have larger range (still easy to quantize since weights are already well-behaved). Result: both fit nicely into INT8. Achieves 1.56x speedup and 2x memory reduction.
FP8 (W8A8): The H100's native FP8 support -- E4M3 format (4 exponent bits, 3 mantissa bits -- better precision for inference) and E5M2 format (5 exponent bits, 2 mantissa bits -- wider range for gradients) -- has become the default for production serving on Hopper GPUs. H100's native FP8 support provides 2x the Tensor Core throughput of FP16, translating to 30-50% serving throughput improvement with 99-100% quality retention on most tasks. This is the sweet spot in 2026: meaningful speedup with negligible quality loss.
You can quantize the KV cache independently of the model weights. This directly addresses the memory bottleneck:
--kv-cache-dtype fp8| Method | Precision | Quality Retention | Throughput Gain | Hardware |
|---|---|---|---|---|
| FP16 baseline | W16A16 | 100% | 1x | Any GPU |
| FP8 | W8A8 | 99-100% | 1.5-2x | H100+ |
| SmoothQuant | W8A8 (INT) | 98-99% | 1.5x | A100+ |
| AWQ | W4A16 | 97-99% | 2-3x | Any GPU |
| GPTQ | W4A16 | 96-99% | 2-3x | Any GPU |
| GGUF Q4_K_M | W4A16 | 95-98% | 2-3x (CPU) | CPU |
My recommendation: FP8 on H100 is the default choice for production. If you need maximum memory efficiency, combine INT4 weight quantization (AWQ) with FP8 KV cache. If you're running on consumer hardware, GGUF formats with llama.cpp.
Here is what this looks like in practice with vLLM:
# FP8 quantization on H100 -- the sweet spot for production
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-70B-Instruct \
--quantization fp8 \
--kv-cache-dtype fp8 \
--tensor-parallel-size 4 \
--max-model-len 8192 \
--enable-prefix-caching
This single command gives you FP8 model weights, FP8 KV cache (2x more requests), tensor parallelism across 4 GPUs, and automatic prefix caching. Before frameworks like vLLM and SGLang existed, achieving this combination -- FP8 quantization, prefix caching, tensor parallelism -- required building custom CUDA kernels. Today, it is a single CLI command.
| Method | Precision | Quality | Throughput | Memory | Hardware |
|---|---|---|---|---|---|
| FP16 (baseline) | W16A16 | 100% | 1.0x | 100% | Any GPU |
| FP8 (H100) | W8A8 | 99.7% | 1.8x | 50% | H100/H200 (FP8) |
| SmoothQuant | W8A8 | 99.5% | 1.5x | 50% | INT8 Tensor Cores |
| GGUF Q8_0 | W8A16 | 99% | 0.6x | 50% | CPU/GPU |
| AWQ | W4A16 | 98.5% | 1.4x | 27% | Any GPU |
| GPTQ | W4A16 | 97.5% | 1.3x | 27% | Any GPU |
| INT4 + FP8 KV | W4A16 + FP8 KV | 97% | 1.5x | 25% | H100 (FP8 KV) |
| GGUF Q4_K_M | W4A16 | 96% | 0.8x | 27% | CPU/GPU |
Figure 6: Quantization methods compared on the accuracy-throughput frontier. FP8 on H100 achieves the best ratio of throughput gain to quality loss. Weight-only INT4 methods (AWQ, GPTQ) are best when memory is the primary constraint.
Quantization is Not Free
Quality degradation is task-dependent. Quantized models may perform well on benchmarks but fail on edge cases in your specific domain. Always evaluate on your actual workload before deploying quantized models in production.
Here is a surprising fact about LLM decode: generating 1 token costs almost the same as verifying 5 tokens in parallel.
Why? The bottleneck during decode is loading model weights from HBM. Whether you multiply those weights by 1 token or 5 tokens, the data transfer cost is nearly identical -- you are reading the same weight matrices either way. The additional compute for 5 tokens is negligible because the GPU is memory-bound.
This means: if we can guess the next K tokens cheaply, we can verify all K in one forward pass of the target model, potentially accepting most or all of them. The expected number of accepted tokens depends on both the draft length K and the acceptance rate alpha, following a geometric series (we will derive this shortly).
I suspect speculative decoding is still underutilized in production because teams struggle with the operational complexity of maintaining two models. The self-speculative methods I'll describe below are changing this.
Speculative decoding (Leviathan et al., ICML 2023; independently, Chen et al., 2023) works in three phases:
def speculative_decode(target_model, draft_model, prompt, K=5):
"""Generate tokens with speculative decoding."""
tokens = prompt
while not tokens.endswith(EOS):
# Phase 1: Draft K tokens cheaply
draft_tokens = []
draft_probs = []
for _ in range(K):
p_draft = draft_model.get_probs(tokens + draft_tokens)
token = sample(p_draft)
draft_tokens.append(token)
draft_probs.append(p_draft)
# Phase 2: Verify all K tokens in ONE forward pass
target_probs = target_model.get_probs_batch(tokens, draft_tokens)
# Phase 3: Accept or reject each token
accepted = []
for i, token in enumerate(draft_tokens):
p_target = target_probs[i][token]
p_draft_token = draft_probs[i][token]
# Accept with probability min(1, p_target / p_draft_token)
if random() < min(1.0, p_target / p_draft_token):
accepted.append(token)
else:
# Reject: sample from residual distribution
residual = max(0, target_probs[i] - draft_probs[i])
residual = residual / residual.sum()
accepted.append(sample(residual))
break # Stop accepting after first rejection
else:
# All K tokens accepted -- sample bonus token from target
bonus = sample(target_probs[K])
accepted.append(bonus)
tokens.extend(accepted)
return tokens
This is the key insight that makes speculative decoding special: the output distribution is identical to standard autoregressive sampling from the target model. It is not an approximation. The name "speculative decoding" is deliberate -- it borrows from speculative execution in CPU design, where the processor guesses ahead and rolls back if the guess is wrong.
Key Insight
This guarantee is what makes speculative decoding so powerful: you get 2-3x faster generation with zero quality degradation. The output is not an approximation -- it is mathematically identical to what the target model would produce on its own.
Here is the intuition first, then the math.
Intuition: For each draft token, we ask: "would the target model have produced this token?" If the target assigns equal or higher probability than the draft model, we always accept -- the draft model guessed something the target model likes. If the target assigns lower probability, we accept with reduced probability proportional to the target's preference. If we reject, we sample from the "leftover" distribution -- the tokens the target model likes but the draft model did not favor enough.
The math: For each position, let be the draft model's probability and be the target model's probability:
Why this produces samples from exactly : Consider two cases for a specific token :
Case 1: (target likes this token at least as much as the draft). The acceptance probability is 1 -- always accept. The contribution to the output distribution from this path is . The remaining comes from the rejection path: the probability of rejection is , and the residual distribution contributes (the two sums are equal by construction). So the total is .
Case 2: (draft overestimates this token). Accept with probability . The accepted contribution is . Plus some contribution from the residual when other tokens are rejected. The full derivation shows:
The better the draft model matches the target, the higher the acceptance rate. A perfect draft model () would accept every token, achieving speedup. In practice, acceptance rates of 70-85% are typical with a well-matched draft model (e.g., Llama 3 8B drafting for Llama 3 70B).
Expected tokens per step: If the acceptance rate is and we draft tokens, the expected number of accepted tokens is:
For and : tokens per step. Since the verification step costs roughly the same as one standard decode step (both load the full model weights), this is a ~3.7x speedup.
This follows from the geometric series: the probability of accepting at least tokens is , so .
Note
Speculative decoding's benefits diminish at high batch sizes. When the GPU is already well-utilized processing a large batch during decode, there is less "free compute" available for the verification pass. In practice, speculative decoding is most effective for low-to-medium concurrency scenarios.
The small draft model generates K=4 candidate tokens autoregressively.
| Position | Token 1 | Token 2 | Token 3 | Token 4 |
|---|---|---|---|---|
| Draft token | The | quick | brown | fox |
| p_draft | 0.85 | 0.72 | 0.65 | 0.90 |
Figure 7: Speculative decoding step-by-step. The draft model generates K candidate tokens quickly, the target model verifies them all in one forward pass, and rejection sampling ensures the output matches the target distribution exactly.
The classic approach: use a smaller model from the same family. For Llama 3 70B as the target, use Llama 3 8B as the draft. The draft model should be 5-10x smaller than the target -- small enough to generate quickly, large enough to match the target's distribution reasonably well.
The acceptance rate depends heavily on how well the draft model matches the target. Some rules of thumb from practice:
The challenge: maintaining two models increases memory requirements. Llama 3 8B at FP16 is 16 GB -- on top of the 70B model's 140 GB (or more, after KV caches). On 8 H100s (640 GB), the draft model adds 2.5% memory overhead, which is manageable. But on smaller setups, the draft model's memory can be significant.
Operational complexity is the bigger concern: both models must use the same tokenizer, both need to be updated when you fine-tune or upgrade, and the serving system must orchestrate the draft-verify-accept loop.
Medusa (Cai et al., ICML 2024): Add K extra MLP "heads" to the target model. Each head predicts a different future token from the same hidden state. Tree-structured verification allows checking multiple candidate continuations simultaneously. Achieves 2.2-3.6x speedup.
EAGLE (Li et al., ICML 2024): Autoregressively predicts at the feature level (second-to-last layer features) rather than the token level, where autoregression is more regular and predictable. EAGLE-2 (EMNLP 2024) added dynamic draft trees. The EAGLE family achieves 3.0-6.5x speedup (EAGLE-2: 3.0-4.3x, EAGLE-3: 4-6x) -- currently one of the fastest speculative decoding methods available.
For a comprehensive taxonomy of speculative decoding approaches, see the survey by Xia et al., ACL 2024.
| Method | Requires Draft Model? | Speedup | Training Needed | Memory Overhead |
|---|---|---|---|---|
| Classic Speculative Decoding | Yes (separate model) | 2-3x | No | High (2 models) |
| Medusa | No (extra heads) | 2.2-3.6x | Yes (heads) | Low (~1B params) |
| EAGLE | No (feature predictor) | 2-3x | Yes (predictor) | Low (~1B params) |
| EAGLE-2/3 | No (dynamic trees) | 3.0-6.5x | Yes (predictor) | Low (~1B params) |
Sections 3-8 presented techniques that optimize serving on a fixed set of GPUs -- better batching, better memory management, better scheduling. But they all share one assumption: prefill and decode run on the same GPUs. This creates a fundamental conflict:
You cannot optimize for both TTFT and TPOT simultaneously on shared hardware. This is the motivation for disaggregation: run prefill and decode on separate GPU pools.
Assign prefill to one set of GPUs and decode to another. After prefill completes, transfer the KV cache to a decode GPU.
Each phase gets its own optimal configuration:
The cost: you must transfer the KV cache from prefill to decode GPUs. For Llama 3 70B at 4K context, that is ~1.25 GB per request. Over NVLink (900 GB/s), the transfer takes ~1.4ms. Over InfiniBand (400 Gb/s = 50 GB/s), about 25ms.
Three major systems have taken different approaches to disaggregation:
| Feature | DistServe | Splitwise | Mooncake |
|---|---|---|---|
| Paper | Zhong et al., OSDI 2024 | Patel et al., ISCA 2024 | Qin et al., FAST 2025 |
| Core idea | Goodput-optimized disaggregation | Heterogeneous hardware per phase | KVCache-centric distributed architecture |
| Hardware | Homogeneous (same GPUs) | Heterogeneous (H100 prefill, A100 decode) | Homogeneous + CPU/SSD for KV storage |
| KV cache location | Transferred between GPU pools | Transferred between GPU pools | Distributed pool (GPU + CPU DRAM + SSD) |
| Scheduler | Phase-specific parallelism optimizer | Cost/throughput optimizer | Global "Conductor" with KV-aware routing |
| Key insight | Prefill and decode want different parallelism | Decode doesn't need expensive GPUs | KV cache is the central resource, not computation |
| Production use | Research system | Research system | Powers Kimi (Moonshot AI), thousands of nodes |
| Main result | 7.4x more requests within SLO | 1.4x throughput at 20% lower cost | 75% more requests under real workloads |
DistServe (Zhong et al., OSDI 2024) formalizes goodput -- throughput within latency SLOs -- as the optimization target. The key insight is that prefill and decode not only have different compute profiles but also benefit from different parallelism strategies:
DistServe also uses bandwidth-aware placement to minimize KV cache transfer costs. Prefill and decode instances that communicate frequently are placed on GPUs with fast interconnects.
Result: 7.4x more requests meeting SLOs, or equivalently, 12.6x tighter SLO targets compared to colocated systems. A 7.4x improvement in SLO compliance means you can serve the same workload on roughly 7x fewer GPUs -- or 7x more users on the same hardware.
Splitwise (Patel et al., ISCA 2024) asks a practical question: if decode does not need much compute, why pay for expensive compute GPUs?
Decode is memory-bandwidth-bound. An A100 has 2 TB/s bandwidth and 312 TFLOPS. An H100 has 3.35 TB/s bandwidth and 989 TFLOPS. For decode, the H100's 3x compute advantage is irrelevant -- bandwidth is only 1.7x better. But the H100 costs roughly 2x more.
Splitwise exploits this: assign H100s to prefill (where you need the 989 TFLOPS) and A100s to decode (where 2 TB/s bandwidth is sufficient). The paper explores several configurations:
This is a practical insight for anyone building a serving cluster. Not every GPU in your fleet needs to be the latest generation.
Mooncake (Qin et al., FAST 2025 Best Paper), the production system behind Moonshot AI's Kimi chatbot, takes disaggregation furthest. Instead of treating KV cache as a side effect of computation, it treats KV cache as the first-class resource around which the entire system is designed.
The architecture has three major components:
A global scheduler called Conductor routes requests based on where their KV caches already reside. If a request's prefix is cached on node 5, Conductor routes it to node 5 to avoid a costly KV cache transfer. This ties directly into prefix caching (Section 10) -- Mooncake combines disaggregation and prefix caching into a unified system.
Mooncake handles overloaded scenarios with prediction-based early rejection: before admitting a request, the system estimates whether it can meet its SLO. If adding the request would degrade existing requests below their SLOs, the new request is rejected immediately. The rationale: it is better to reject one request with a clean error than to return slow responses for everyone.
Under real production workloads serving Kimi on A800 clusters, Mooncake handles 75% more requests. In simulated scenarios with varying context lengths, throughput improvements range from 50% to 525% compared to vLLM baselines.
Problem: Prefill and decode compete for the same GPU resources.
A long prefill stalls all ongoing decodes, spiking TPOT for every request in the batch.
Cannot independently optimize batch size or parallelism for each phase.
Figure 8: Colocated vs. disaggregated serving. In the colocated model (left), prefill and decode compete for the same GPU resources. In the disaggregated model (right), each phase gets its own GPU pool with tailored configurations. The KV cache transfer between phases is the key overhead.
Disaggregation is not free:
My sense is that disaggregation will be the default architecture for any large-scale deployment within the next year or two. The engineering overhead is real, but the efficiency gains are too large to ignore at scale.
Many LLM workloads have shared prefixes:
If 100 requests share a 2000-token system prompt, a naive system computes the KV cache for those 2000 tokens 100 times. With prefix caching: compute it once, reuse it 99 times. That is 99% less prefill compute for the shared prefix.
vLLM implements automatic prefix caching at the block level. Here is how it works:
hash(token_ids[i:i+16])This works transparently -- no code changes needed. Enable it with --enable-prefix-caching and vLLM handles the rest. The hash computation is negligible (microseconds) compared to the prefill compute it saves (milliseconds to seconds).
Concrete savings: If your system prompt is 1000 tokens and you serve 100 requests per minute, without prefix caching you compute the system prompt 100 times. With prefix caching, you compute it once and reuse it 99 times. For a 70B model, that saves roughly 99 x 20ms = ~2 seconds of GPU time per minute. At scale (thousands of requests per minute), the savings are substantial.
Limitations: matching is at block boundaries (16-token granularity), and only exact prefix matches count. A one-token difference in the prefix results in a full miss for the differing block and all subsequent blocks. This means prefix caching works best when prefixes are identical byte-for-byte -- which they are for system prompts and few-shot examples, but not for paraphrased content.
SGLang (Zheng et al., NeurIPS 2024) uses a radix tree (compressed prefix tree) to store cached KV caches. A radix tree is a trie where nodes with single children are merged, making it efficient for storing strings with shared prefixes.
Here is how RadixAttention differs from vLLM's block-level hashing:
vLLM approach: Hash each 16-token block independently. Look up the hash in a flat hash table. If found, reuse the cached KV block. This is fast but rigid -- matching only works at 16-token boundaries, and there is no structural relationship between cached entries.
RadixAttention approach: Maintain a tree where each edge represents a sequence of tokens. The tree captures the hierarchical structure of cached prefixes. When a new request arrives, traverse the tree from the root, matching tokens until a mismatch. The matched portion is a cache hit; only the unmatched suffix needs computation.
Example tree structure:
Root
├── "You are a helpful assistant..." (system prompt A)
│ ├── "Translate this to French: ..." (request 1)
│ ├── "Summarize the following: ..." (request 2)
│ └── "What is the capital of ..." (request 3)
└── "You are a coding expert..." (system prompt B)
├── "Write a Python function ..." (request 4)
└── "Debug this code: ..." (request 5)
All requests under system prompt A share the same KV cache for the system prompt. Requests 1, 2, and 3 only compute KV for their unique suffixes. If request 6 arrives with the same system prompt A and a new query, it finds the system prompt in the tree, reuses its KV cache, and only computes the query portion.
RadixAttention also enables cache-aware scheduling: the scheduler reorders the request queue to prioritize requests whose prefixes have the highest cache hit rates. This maximizes reuse and reduces total compute.
SGLang achieves 50-99% cache hit rates depending on the workload. For multi-call workloads (agents that make multiple LLM calls with shared context, such as tree-of-thought reasoning or multi-step tool use), RadixAttention delivers up to 5x throughput improvement. The gains are proportional to how much prefix sharing exists in the workload.
Select a request to see how much of its prefix is already cached.
Figure 9: Prefix caching with a radix tree. Requests sharing common prefixes (like system prompts) reuse cached KV blocks. Green nodes represent cache hits (skipped computation); red nodes represent cache misses (computed fresh).
| Workload | Cache Hit Rate | Benefit |
|---|---|---|
| Chatbot with system prompt | 90-99% | High (same prefix for every request) |
| Multi-turn conversation | 70-95% | High (previous turns are shared) |
| Agents / multi-step reasoning | 80-95% | Very high (same context, multiple calls) |
| RAG with few documents | 50-80% | Moderate (shared document context) |
| Diverse unique prompts | Under 10% | Minimal |
The overhead of prefix caching is negligible (small hash table or tree lookup). There is no reason not to enable it -- it either helps or does nothing.
With the techniques covered, the practical question becomes: which framework implements them best for your workload?
The serving framework ecosystem has matured rapidly. In 2022, you had two choices: HuggingFace Transformers (easy but slow) or FasterTransformer (fast but hard to use). Today, there are at least six production-quality options, and all of them support continuous batching, PagedAttention (or equivalent), quantization, and multi-GPU parallelism.
The differences are in the details: which workloads each framework optimizes for, how extensible they are, and which hardware they support.
Core innovation: PagedAttention. The first open-source engine to make KV cache memory management a first-class concern.
Best for: High-concurrency production deployments, teams migrating from OpenAI API.
Core innovation: RadixAttention for prefix caching, structured generation primitives.
Best for: Agents, RAG pipelines, structured output, prefix-heavy workloads.
Core innovation: Fused CUDA kernel compilation per model and input shape.
Best for: Latency-sensitive applications, maximum hardware utilization on NVIDIA GPUs.
Core innovation: CPU-first inference with aggressive quantization.
Best for: Local/edge deployment, prototyping, privacy-sensitive applications, running models on hardware you already own.
DeepSpeed-FastGen (Holmes et al., 2024): Microsoft's framework with Dynamic SplitFuse -- decomposes long prompts into chunks and composes short prompts to fill a target token budget. Integrated into Azure ML. Claims 2.3x throughput over vLLM in some scenarios.
TGI (Text Generation Inference): HuggingFace's serving solution. Good default for teams already in the HuggingFace ecosystem. Easy deployment, Flash-Decoding integration, reasonable performance without extensive tuning.
MLC LLM: Cross-platform compilation (CUDA, Metal, Vulkan, WebGPU). Compiles models for diverse hardware including phones and browsers. The choice when you need to run on non-NVIDIA hardware.
| Framework | Throughput | Latency | Ease of Use | Hardware | Best For |
|---|---|---|---|---|---|
vLLM PagedAttention | NVIDIA, AMD | High-concurrency production | |||
SGLang RadixAttention | NVIDIA, AMD | Agentic / structured output | |||
TensorRT-LLM Fused CUDA kernels | NVIDIA only | Max single-request perf | |||
llama.cpp CPU-first quantization | CPU, NVIDIA, Apple, Vulkan | Local / edge deployment | |||
TGI HuggingFace integration | NVIDIA, AMD, Intel, TPU | Easy production deployment |
Figure 10: Serving framework comparison. Filter by your hardware, workload type, and model size to find the best fit.
| If you need... | Use... | Why |
|---|---|---|
| High-concurrency API server | vLLM | Best throughput at scale, OpenAI-compatible |
| Agent/RAG workloads | SGLang | RadixAttention, prefix caching, structured output |
| Maximum latency optimization | TensorRT-LLM | Fused kernels, deepest hardware optimization |
| Run on laptop/phone | llama.cpp | CPU-first, no GPU required |
| Cross-platform (Metal, WebGPU) | MLC LLM | Compiles to any backend |
| Azure ML integration | DeepSpeed-FastGen | Native Microsoft stack |
| Quick start, HuggingFace models | TGI | Simplest deployment |
As far as I know, there is no single framework that dominates across all workloads. The field is moving too fast for any one system to maintain a permanent advantage. In late 2024 and early 2025, SGLang pulled ahead of vLLM on several benchmarks. vLLM responded with significant performance improvements. TensorRT-LLM remains fastest for latency-sensitive single-request scenarios. The leaderboard shifts every few months.
My practical advice: start with vLLM or SGLang (both are easy to set up and have OpenAI-compatible APIs). Here is a minimal setup for each:
# vLLM: 3 lines to a production-grade API server
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8B-Instruct \
--enable-prefix-caching
# Server is now running at http://localhost:8000
# Compatible with OpenAI Python client
# SGLang: similarly simple
pip install sglang[all]
python -m sglang.launch_server \
--model-path meta-llama/Llama-3-8B-Instruct \
--port 8000
Benchmark on your actual workload, and switch if you find a specific framework that is clearly better for your use case. Most production teams I have talked to evaluate 2-3 frameworks before committing.
Route easy requests to small models and hard requests to large models. If a 7B model can handle a simple factual question, why spend compute on a 70B model?
The challenge is determining difficulty before running the large model. Several approaches exist:
The economics are compelling. If 70% of requests can be handled by a 7B model (10x cheaper to serve), the average cost per request drops by ~60% with only a small quality reduction on the remaining 30% of hard queries. This is essentially what API providers do when they offer different model tiers.
Beyond quantization, you can selectively evict less important KV cache entries to free memory.
H2O (Heavy-Hitter Oracle) (Zhang et al., NeurIPS 2023): Observes that a small subset of tokens ("Heavy Hitters") receive disproportionate attention across layers and heads. These are typically content-bearing tokens (nouns, verbs, key entities) rather than function words.
H2O's eviction policy keeps two categories of tokens:
Everything else is evicted from the KV cache. With only 20% of tokens retained, generation quality degrades minimally on most tasks (within 1-2% on benchmarks), while throughput improves up to 29x because the attention computation and KV cache memory shrink proportionally.
Other approaches to KV cache compression include:
This is one of the most active research areas in LLM serving, with new papers appearing monthly.
StreamingLLM (Xiao et al., ICLR 2024) discovered a surprising phenomenon: LLMs attend strongly to the first few tokens regardless of their semantic content. These "sink" tokens absorb excess attention mass.
Why does this happen? Softmax requires attention weights to sum to 1. When a token has no strong preference for any particular earlier token, the attention mass has to go somewhere. It gets dumped onto the earliest tokens as a mathematical artifact -- they serve as an "attention sink." If you evict these initial tokens from the sliding window (as naive window attention would), the attention distribution breaks and generation quality collapses.
StreamingLLM's fix is simple: always keep the first 4 tokens (the "sink" tokens) in the KV cache, plus a sliding window of the most recent N tokens. The cache has constant size: 4 + N tokens, regardless of how many tokens have been generated.
Standard KV cache (grows forever):
[tok_1, tok_2, ..., tok_1000, tok_1001, ..., tok_10000, ...]
Memory grows: eventually OOM
Sliding window (fails -- drops sinks):
[..., tok_9997, tok_9998, tok_9999, tok_10000]
Quality collapses when tok_1-tok_4 are evicted
StreamingLLM (constant memory):
[tok_1, tok_2, tok_3, tok_4, ..., tok_9997, tok_9998, tok_9999, tok_10000]
↑ sink tokens (always kept) ↑ sliding window (recent N tokens)
This enables handling 4M+ token streams with 22.2x speedup over sliding-window recomputation. The technique has been integrated into TensorRT-LLM and HuggingFace Transformers.
The limitation is important: this enables infinite generation, not infinite understanding. The model can only attend to the sink tokens and the recent window -- it has no access to tokens that have been evicted. For applications requiring true long-context understanding (like analyzing a 100K-token document), you still need the full KV cache.
The Mooncake approach is evolving into multi-tier KV cache hierarchies:
KV cache becomes a distributed cache service, with migration and replication policies similar to content delivery networks. This is where LLM serving infrastructure starts to look more like distributed systems research than ML research.
FP4 quantization: NVIDIA's Blackwell (B100/B200) GPUs support FP4 Tensor Cores, which could deliver up to 2x throughput over FP8 if quality retention proves sufficient. Early results show 96-98% quality retention.
CXL (Compute Express Link): Enables expanded memory pools by connecting CPU and GPU memory through a coherent interconnect. A CXL-attached DRAM pool could provide 10x more memory for KV caches at moderate bandwidth.
Specialized inference accelerators: Groq's LPU (Language Processing Unit) has demonstrated 500+ tokens/second per user in public demos (as of early 2025) by using SRAM (Static RAM -- fast, small on-chip memory)-only architecture that eliminates the HBM bottleneck entirely. Since decode is memory-bandwidth-bound and SRAM is 10x faster than HBM, removing HBM from the critical path fundamentally changes the performance equation. Cerebras WSE-3 takes a similar approach, fitting entire models in on-chip memory across its wafer-scale chip.
These accelerators have limitations -- cost per chip, limited flexibility, and smaller software ecosystems -- but they demonstrate that the memory-bound decode bottleneck is a hardware design problem, not just a software optimization problem.
The trend: hardware and software co-design, where serving systems, model architectures, and accelerator hardware are designed together. GQA is a good early example: a model architecture change (fewer KV heads) that was motivated by serving efficiency. Future models will likely incorporate more serving-aware design decisions during training.
In short, the key takeaways from this article:
LLM serving is a memory management problem. The KV cache -- not the model weights -- is the bottleneck once you are serving concurrent requests. Every technique in this article ultimately manages KV cache memory.
The prefill-decode duality shapes everything. Prefill is compute-bound; decode is memory-bound. These phases want different batch sizes, different parallelism strategies, and even different hardware. Recognizing this duality is the key to understanding serving architecture.
Continuous batching + PagedAttention are table stakes. Any production serving system in 2026 should use both. Together they deliver 10-50x throughput over naive static batching -- the single biggest win comes from simply switching from HuggingFace Transformers to vLLM or SGLang.
Speculative decoding offers "free" speedup. The output distribution is mathematically identical to standard decoding. Use self-speculative methods (Medusa, EAGLE) if you want the speedup without maintaining a second model.
Prefill-decode disaggregation is becoming the default for large-scale deployment. The engineering overhead is real, but 2-7x more requests within SLO is hard to ignore.
Prefix caching should always be enabled. It is free throughput for any workload with shared structure -- and most real workloads do.
FP8 on H100 is the current sweet spot for quantization: 30-50% throughput gain at 99%+ quality. INT4 weights + FP8 KV cache for maximum memory efficiency.
No single serving framework dominates. vLLM for high concurrency, SGLang for agentic workloads, TensorRT-LLM for maximum hardware utilization, llama.cpp for local deployment.
The speed of innovation in LLM serving has been remarkable. In just three years (2022-2025), the field went from naive static batching to sophisticated systems that borrow ideas from OS kernels, compilers, and distributed databases. For a framework to organize all of these techniques, see the next section.
After 13 sections of techniques, it helps to have a framework for organizing everything. Here is how I think about the LLM serving stack, from bottom to top:
Layer 1 -- Hardware: GPU memory hierarchy (HBM, SRAM, registers), interconnect topology (NVLink, InfiniBand, PCIe), specialized accelerators. Every optimization ultimately maps to a hardware capability.
Layer 2 -- Memory Management: PagedAttention, KV cache quantization, KV cache compression (H2O), prefix caching (RadixAttention). Controls how the scarce KV cache memory is allocated, shared, and reclaimed.
Layer 3 -- Execution Engine: Continuous batching, chunked prefill, FlashAttention kernels, fused CUDA kernels. Controls how computation is organized within a single forward pass.
Layer 4 -- Scheduling: Request admission control, preemption (swap vs. recompute), SLO-aware scheduling, cache-aware routing. Controls which requests run when.
Layer 5 -- Optimization: Quantization (weight, activation, KV cache), speculative decoding. Orthogonal techniques that reduce resource requirements.
Layer 6 -- Architecture: Disaggregated prefill/decode, parallelism strategy (TP, PP, EP), multi-tier KV cache hierarchy. Controls the overall system topology.
Layer 7 -- API: OpenAI-compatible endpoint, streaming responses, structured output constraints, multi-LoRA routing. The interface users and applications see.
When you encounter a new technique or paper, ask: which layer does it operate at? What resource constraint does it address? This mental model turns a collection of disparate tricks into a coherent system design.
For example:
The most impactful innovations tend to span multiple layers. A single-layer optimization (like a faster kernel) gives incremental gains. A system that co-optimizes across layers (like Mooncake combining memory management, scheduling, and architecture) gives multiplicative gains.
I strongly suspect the biggest advances in the next 2-3 years will come from this cross-layer co-optimization -- techniques that treat the serving stack as a single integrated system rather than a collection of independent components. The serving stack is converging with the OS kernel design community for good reason: both manage scarce shared resources under competing demands with strict performance constraints.
LLM serving has become its own subfield of systems research. The mental model above should help you navigate whatever comes next.
Remember those 20 concurrent users that crashed your 70B model? Now you know exactly why: each 4K-token request allocates 2.5GB of KV cache, and 20 of those exceeds even 8×H100s' pooled memory. The fix: PagedAttention eliminates fragmentation waste, continuous batching maximizes throughput, and speculative decoding generates faster without changing the output. The same $200K GPU cluster now handles 500+ concurrent users—not because you threw more hardware at it, but because you understood where the memory actually goes.
Foundational Papers:
Core Serving Systems:
Speculative Decoding:
Disaggregation:
Quantization:
KV Cache Management:
Educational Resources:
Code: