Skip to main content
Skip to content

LLM Serving from Scratch: The Systems Behind Fast Inference

·50 min read

You just deployed a 70B parameter LLM on 8 H100 GPUs. The model weights consume 140 GB of memory. You start serving requests and throughput looks great -- until you hit 20 concurrent users with 4K context windows, and suddenly your GPUs are out of memory. But wait: 20 users times 4K tokens is only 80K tokens. How can that consume more memory than the 70 billion parameter model itself?

The answer is the KV cache. And understanding it is the key to understanding everything about LLM serving.

I wrote this article because I kept running into the same pattern: teams deploy an LLM, hit a mysterious memory wall, and then scramble to figure out why their $200K GPU cluster can only handle a dozen users. The serving stack has matured enormously in the past three years -- from Orca's continuous batching in 2022 to Mooncake's disaggregated architecture winning Best Paper at FAST 2025 -- but most of the knowledge lives scattered across academic papers and framework docs. This article brings it together.

What you'll learn:

  • Why the KV cache -- not the model weights -- is the memory bottleneck in production serving
  • How continuous batching achieves 23x higher throughput than static batching
  • Why PagedAttention borrows OS virtual memory concepts to eliminate memory waste
  • How speculative decoding generates 2-3x faster with mathematically identical outputs
  • Why leading systems separate prefill and decode onto different hardware
  • How to choose between vLLM, SGLang, TensorRT-LLM, and other frameworks
  • Emerging techniques: chunked prefill, attention sinks, KV cache compression

Prerequisites: Familiarity with the transformer attention mechanism (softmax(QKT/d)V\text{softmax}(QK^T/\sqrt{d})V), autoregressive text generation (token-by-token), and basic GPU concepts (what HBM (High Bandwidth Memory) is, why memory bandwidth matters). Comfortable reading Python pseudocode.

Skip-ahead guide: If you already understand KV caching and memory management, jump to Section 5 (Request Scheduling) for the systems-level techniques. If you're evaluating serving frameworks, jump to Section 11 (Framework Comparison).


Part I: The Problem

1. The Serving Problem

1.1 Traditional ML Serving vs. LLM Serving

Serving an image classifier is simple. A request comes in, you run one forward pass, return the result, free memory. The GPU holds the model weights and a batch of inputs -- both fixed size. Memory is predictable.

LLM serving breaks every assumption that makes traditional serving easy:

PropertyImage ClassifierLLM
Forward passes per request1100-10,000+ (one per output token)
Memory per requestFixed (input size)Growing (KV cache accumulates)
Request lifetimeMillisecondsSeconds to minutes
State between stepsNoneKV cache (potentially GBs)
Output lengthFixedUnknown until end-of-sequence (EOS) token

An image classifier processes a batch in one shot and frees memory. An LLM request generating 500 tokens holds GPU memory across 500 forward passes, and you cannot predict how many tokens it will generate before it starts. This makes LLM serving fundamentally a resource management problem, not just a compute problem.

1.2 The Autoregressive Bottleneck

Each token depends on all previous tokens. There is no way to parallelize generation within a single request -- token 5 cannot be generated until tokens 1-4 exist. Generating N tokens requires N serial forward passes.

This has a direct consequence: latency is proportional to output length. If each forward pass takes 30ms, generating 200 tokens takes 6 seconds. No amount of GPU compute changes this sequential dependency. You can make each step faster, but you cannot skip steps.

To put this in perspective: an image classifier processes a 224x224 image in a single 5ms forward pass. An LLM generating a 500-word response (~750 tokens) at 30ms per token takes 22.5 seconds -- 4,500x longer. And during those 22.5 seconds, the request's KV cache sits in GPU memory, blocking other requests from using that space.

This sequential nature also means the generation phase is embarrassingly serial within each request. If you have a GPU with 989 TFLOPS of compute (H100), most of that compute power sits idle during single-request decode because the work per step is so small. A single matrix-vector multiply for a 70B model is about 140 billion FLOPs -- which an H100 completes in less than 0.2ms. Loading 140 GB of weights at 3.35 TB/s takes about 42ms. In practice, with partial overlap and optimization, each step takes roughly 30ms -- the vast majority of which is spent waiting for memory.

(The one exception is speculative decoding, which we cover in Section 8. It generates multiple tokens per step by cleverly guessing ahead.)

1.3 Prefill vs. Decode: Two Very Different Workloads

Here is the most underappreciated fact about LLM serving: a single serving request involves two fundamentally different computations.

A GPU has two finite resources: compute units (measured in TFLOPS) and memory bandwidth (measured in TB/s). An operation is compute-bound when the compute units are the bottleneck -- the GPU finishes reading data faster than it can crunch the numbers. An operation is memory-bound when the memory bandwidth is the bottleneck -- the compute units sit idle waiting for data to arrive from HBM. Which regime you are in depends on the operation's arithmetic intensity: the ratio of compute operations to bytes transferred. For an H100, the crossover point is about 295 FLOPs/byte (989 TFLOPS / 3.35 TB/s). Operations above this ratio are compute-bound; below it, memory-bound.

Prefill (processing the input prompt): The model processes all input tokens in parallel. This is a large matrix-matrix multiplication. The GPU's compute cores are busy. The bottleneck is FLOPS -- this is compute-bound.

Decode (generating output tokens): The model generates one token at a time. Each step is a matrix-vector multiplication (one new token times the weight matrices). The bottleneck is loading model weights from HBM for each tiny computation -- this is memory-bandwidth-bound.

CharacteristicPrefillDecode
Computation typeMatrix-matrix multiplyMatrix-vector multiply
BottleneckCompute (FLOPS)Memory bandwidth (GB/s)
Arithmetic intensityHigh (~100+ FLOPs/byte)Low (~1-10 FLOPs/byte)
Tokens processedHundreds to thousands at onceOne at a time
GPU utilization60-80%5-30%
Ideal batch sizeSmall (1-4) to saturate computeLarge (64-256+) to amortize weight loading

In my experience, this dual nature of LLM serving is the single most underappreciated aspect by teams new to deployment.

Prefill wants few requests with lots of tokens each; decode wants many requests to amortize the cost of loading weights. A single serving system must handle both simultaneously, and that tension drives most of the architectural decisions in this article.

1.4 The Memory Equation

GPU memory during LLM serving is consumed by four things:

GPU Memory=Model Weights+KV Cache+Activations+Framework Overhead\text{GPU Memory} = \text{Model Weights} + \text{KV Cache} + \text{Activations} + \text{Framework Overhead}

For a Llama 3 70B model in FP16:

  • Model weights: 70B×2 bytes=140 GB70\text{B} \times 2\text{ bytes} = 140\text{ GB}
  • KV cache for one request at 4K context: ~1.25 GB (we will derive this in Section 2)
  • KV cache for one request at 128K context: ~40 GB
  • Activations: ~1-2 GB (proportional to batch size, but small relative to KV cache)
  • Framework overhead: ~1-5 GB (CUDA context, memory allocator, etc.)

The fundamental tension: on 8 H100s (640 GB total HBM), after storing the 140 GB model, you have ~500 GB for KV caches. At 4K context, that supports ~400 concurrent requests. At 128K context, that supports ~12.

Let me put this in dollar terms. An 8xH100 node costs roughly 25,000/monthtorentfromacloudprovider.At128Kcontext,youcanserve12concurrentusersthatis25,000/month to rent from a cloud provider. At 128K context, you can serve 12 concurrent users -- that is 2,000/month per concurrent user slot, just for the hardware. At 4K context, it is $62/month per slot. Long context is expensive not because the model is bigger, but because each user's KV cache is bigger.

This is why I say LLM serving is a memory management problem, not a compute problem. The KV cache is the scarce resource that every technique in this article ultimately manages.

LLM Serving Memory Calculator

Memory Usage: 178.9 GBAvailable: 640.0 GB
Weights
KV Cache
Weights (130.4 GB) KV Cache (40.0 GB) Activations (8.0 GB)
Status
Fits in memory
KV Cache per Request
1.3 GB
Max Batch Size
339 requests
KV Cache Formula (per request):
2 x 80 layers x 8 kv_heads x 128 head_dim x 2B x 4,096 tokens = 1.3 GB

Figure 1: The LLM serving memory equation. Adjust model size, context length, batch size, and precision to see how KV cache grows to dominate GPU memory. The stacked breakdown shows weights (fixed) vs. KV cache (variable) vs. other overhead.


2. The KV Cache: A Systems Perspective

My attention optimization article explains how KV caching works algorithmically -- why caching keys and values avoids redundant computation and reduces generation from O(N^2) to O(N) total work. Here, we examine the KV cache from a different angle: its systems implications. How much memory does it actually consume? Why does naive management waste 60-80% of that memory? And how does this shape every architectural decision in a serving system?

2.1 Why the KV Cache Exists (Quick Review)

During autoregressive generation, the model attends to all previous tokens at each step. Without caching, this means recomputing the key (K) and value (V) projections for every previous token at every step -- O(N^2) total computation for N tokens.

The KV cache stores previously computed K and V tensors, so each step only computes K and V for the single new token and appends them to the cache. This reduces total computation to O(N). For details, see the attention optimization article.

The trade-off is memory. Every token in the cache consumes memory, and that memory is held for the request's entire lifetime. For a 70B model with 80 layers, each cached token requires:

2×80×8×128×2=327,680 bytes320 KB per token2 \times 80 \times 8 \times 128 \times 2 = 327{,}680 \text{ bytes} \approx 320 \text{ KB per token}

A 4K-token conversation holds 1.25 GB of KV cache. A 128K-token document analysis holds 40 GB. And this is per request -- multiply by your concurrent request count to get the total memory pressure.

2.2 KV Cache Memory Arithmetic

The KV cache stores two tensors (K and V) per layer, per token, for the KV heads:

KV Memory=2×L×nkv×dh×bprec×S×B\text{KV Memory} = 2 \times L \times n_{\text{kv}} \times d_h \times b_{\text{prec}} \times S \times B

Where LL = number of layers, nkvn_{\text{kv}} = number of KV heads, dhd_h = head dimension, bprecb_{\text{prec}} = bytes per element (2 for FP16), SS = sequence length, BB = batch size.

Let me compute this for real models:

def kv_cache_memory_gb(layers, kv_heads, head_dim, seq_len, batch_size=1, precision_bytes=2):
    """KV cache memory in GB for a single request or batch."""
    bytes_total = 2 * layers * kv_heads * head_dim * precision_bytes * seq_len * batch_size
    return bytes_total / (1024**3)

# Llama 3 8B: 32 layers, 8 KV heads (GQA), d_h=128
print(f"Llama 3 8B  @ 4K:   {kv_cache_memory_gb(32, 8, 128, 4096):.2f} GB")
print(f"Llama 3 8B  @ 128K: {kv_cache_memory_gb(32, 8, 128, 131072):.2f} GB")

# Llama 3 70B: 80 layers, 8 KV heads (GQA), d_h=128
print(f"Llama 3 70B @ 4K:   {kv_cache_memory_gb(80, 8, 128, 4096):.2f} GB")
print(f"Llama 3 70B @ 128K: {kv_cache_memory_gb(80, 8, 128, 131072):.2f} GB")

# Llama 3 405B: 126 layers, 8 KV heads (GQA), d_h=128
print(f"Llama 3 405B @ 4K:  {kv_cache_memory_gb(126, 8, 128, 4096):.2f} GB")
print(f"Llama 3 405B @ 128K:{kv_cache_memory_gb(126, 8, 128, 131072):.2f} GB")
ModelLayersKV HeadsKV per request @ 4KKV per request @ 128K
Llama 3 8B328 (GQA)0.50 GB16 GB
Llama 3 70B808 (GQA)1.25 GB40 GB
Llama 3 405B1268 (GQA)1.97 GB63 GB

Notice something: Llama 3 70B has 64 attention heads but only 8 KV heads. This is Grouped-Query Attention (GQA, Ainslie et al., 2023), where groups of query heads share key-value heads. Without GQA, the 70B model would use 64 KV heads and the cache would be 8x larger -- 10 GB per request at 4K context. GQA is not just a training trick; it is a serving requirement. Multi-Query Attention (MQA, Shazeer, 2019) takes this further, sharing a single KV head across all query heads.

2.3 The Batch Size vs. Context Length Tension

On 8 H100s (640 GB total HBM), a Llama 3 70B model in FP16 uses 140 GB for weights, leaving ~500 GB for KV caches (ignoring activations and overhead for simplicity).

  • At 4K context: 500÷1.25400500 \div 1.25 \approx 400 concurrent requests
  • At 32K context: 500÷10=50500 \div 10 = 50 concurrent requests
  • At 128K context: 500÷4012500 \div 40 \approx 12 concurrent requests

This is the fundamental trade-off: throughput vs. context length. You cannot have both. The relationship is a strict inverse: double the context length, halve the maximum batch size.

This explains why long-context applications (128K-1M token contexts) are so expensive to serve. A user asking questions about a long document consumes 30-100x more KV cache memory than a chatbot conversation. And unlike model weights, which are shared across all requests, each request's KV cache is unique and cannot be shared (except through prefix caching, Section 10).

Here is how this plays out in terms of throughput. A serving system with batch size B generates B tokens per step. If each step takes 30ms:

  • At 4K context (batch 400): 400÷0.0313,300400 \div 0.03 \approx 13{,}300 tokens/second
  • At 128K context (batch 12): 12÷0.0340012 \div 0.03 \approx 400 tokens/second

That is a 33x throughput reduction from 4K to 128K context. The model's compute cost per token is the same -- all the lost throughput comes from memory constraints.

Every system in this article grapples with this constraint. Techniques like quantization (Section 7) shrink the cache to push the boundary; PagedAttention (Section 4) eliminates waste so you can fill the boundary more efficiently; disaggregation (Section 9) lets you use different hardware for the two phases; and prefix caching (Section 10) avoids recomputing shared prefixes entirely.

2.4 Why Naive KV Cache Management Wastes Memory

The simplest approach: when a request arrives, pre-allocate a contiguous chunk of GPU memory for its KV cache at the maximum possible sequence length. When the request finishes, free the chunk.

This wastes memory in three ways:

  1. Reservation waste: You allocate for max_seq_len (say 2048 tokens) even if the request only generates 50 tokens. Memory for 1998 unused token positions sits idle.

  2. Internal fragmentation: Even within the allocated region, the cache only fills as tokens are generated. A request that will eventually use 500 tokens starts with 499 empty slots.

  3. External fragmentation: As requests arrive and depart, the free memory becomes a patchwork of non-contiguous holes. A new request needing 1 GB of contiguous memory might fail even if 2 GB is free -- because it is in 50 scattered 40 MB chunks.

The vLLM paper (Kwon et al., 2023) measured this: existing systems waste 60-80% of KV cache memory to fragmentation. That means on our 500 GB budget, only 100-200 GB is actually used for live tokens. This waste directly limits throughput.

The solution is PagedAttention (Section 4), which borrows virtual memory concepts from operating systems to eliminate nearly all of this waste.


Part II: Core Solutions

3. Batching Strategies

3.1 Why Batching Matters

During decode, each request generates one token per step. The computation is a single matrix-vector multiply: the model weights times one token embedding. On an H100 with 3.35 TB/s bandwidth and 989 TFLOPS, the balance point is ~295 FLOPs/byte. A matrix-vector multiply has arithmetic intensity of about 1-2 FLOPs/byte -- the GPU loads the full weight matrices and barely uses them.

GPU utilization during single-request decode is often below 5%. The hardware sits idle, waiting for data to move.

Batching multiple requests converts memory-bound matrix-vector multiplies into compute-bound matrix-matrix multiplies. If you batch 64 requests, the GPU loads the weight matrices once and multiplies them by 64 tokens simultaneously. Same data transfer, 64x more useful compute. Throughput scales nearly linearly with batch size -- until you run out of KV cache memory.

3.2 Static Batching (The Naive Approach)

The simplest batching strategy: collect B requests, process them as a batch, wait until all finish. This is how traditional ML serving works -- it processes a batch of images, returns all results, and moves on.

For LLMs, the problem is variable-length generation. Unlike image classification where every input produces one output, LLM requests generate different numbers of tokens. Consider a batch of 8 requests with output lengths [10, 15, 20, 50, 100, 150, 200, 500]:

  • After 10 steps: 1 request is done. 7 are still generating.
  • After 50 steps: 4 requests are done. 4 are still generating. The 4 finished requests waste their batch slots -- their KV caches sit in memory, their positions in the batch matrix are filled with padding tokens that consume compute but produce nothing.
  • After 500 steps: The last request finishes. The batch slot of the first request was wasted for 490 of 500 steps.

Let me quantify this. Total useful computation: 10+15+20+50+100+150+200+500=1,04510 + 15 + 20 + 50 + 100 + 150 + 200 + 500 = 1{,}045 token-steps. Total computation performed: 8×500=4,0008 \times 500 = 4{,}000 token-steps (every slot runs for all 500 steps). Average GPU utilization: 1,045/4,00026%1{,}045 / 4{,}000 \approx 26\%. And this is just compute waste -- the KV cache memory for finished requests is also wasted.

From the user's perspective: the person who asked a simple yes/no question (10 tokens) waited for the person who asked for a 500-word essay. Both get their responses at the same time, despite the 50x difference in output length.

3.3 Continuous Batching (Orca, 2022)

The Orca paper (Yu et al., OSDI 2022) introduced the key insight: schedule at the iteration level, not the request level. When a request finishes (emits an EOS token), immediately replace it with a waiting request. No padding, no waiting.

Here is the iteration-level scheduling loop:

# Simplified continuous batching scheduler
active_batch = []
waiting_queue = deque()

while True:
    # Step 1: Run one forward pass on the active batch
    outputs = model.forward(active_batch)

    # Step 2: Check for completed requests
    for request in active_batch:
        if request.last_token == EOS:
            send_response(request)
            active_batch.remove(request)

    # Step 3: Fill empty slots with waiting requests
    while len(active_batch) < max_batch_size and waiting_queue:
        new_request = waiting_queue.popleft()
        run_prefill(new_request)  # Process input prompt
        active_batch.append(new_request)

The batch is always full (or as full as the queue allows). GPU compute is never wasted on finished requests.

An important subtlety: different requests in the batch may be at very different points in their generation. Request A might be 500 tokens into generation while Request B just started. They have different KV cache lengths, which means the attention computation differs per request. Orca handles this with selective batching: operations that depend on sequence length (attention layers) process each request individually, while operations that are independent of position (MLP layers, layer norms) batch across all requests.

Anyscale reported 23x throughput improvement over static batching in a 2023 blog post. The gain comes from two sources: (1) no wasted compute on finished requests, and (2) the batch stays full, maximizing GPU utilization during the memory-bound decode phase. The actual number depends on the variance in output lengths -- uniform lengths (like fixed-format tasks) show smaller gains, while highly variable lengths (like open-ended chat) show larger gains.

Static vs. Continuous Batching

8 requests, batch size 4. Numbers = request ID. Darker = prefill, lighter = decode.

Static Batching
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Slot 0
0
0
0
0
0
0
2
2
2
2
6
6
6
6
6
Slot 1
1
1
1
1
1
1
1
1
1
3
3
3
3
3
3
3
3
3
3
3
3
3
3
7
7
7
7
7
Slot 2
4
4
4
4
4
4
Slot 3
5
5
5
5
5
5
5
28 time steps total
GPU utilization: 50%
Continuous Batching
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Slot 0
0
0
0
0
0
0
1
1
1
3
3
3
3
3
3
3
Slot 1
1
1
1
1
1
1
3
3
3
4
4
5
5
6
7
7
Slot 2
2
2
2
2
3
4
4
4
5
5
6
6
7
Slot 3
3
3
3
4
5
5
5
6
6
7
7
16 time steps total
GPU utilization: 88%
Darker shade = prefillLighter shade = decodeEmpty = wasted GPU cyclesGray = padding (request finished)

Figure 2: Static vs. continuous batching. In static batching (top), finished requests waste their slots until the entire batch completes. In continuous batching (bottom), finished requests are immediately replaced, keeping the GPU fully utilized.

3.4 The Prefill Problem and the "Prefill Bubble"

Continuous batching has a subtle problem. When a new request enters the batch, its prompt must be prefilled -- all input tokens processed in a single forward pass. Prefill is compute-heavy: a 2000-token prompt generates 2000 KV pairs in one step.

During this prefill, all other requests in the batch must wait. Their decode steps (generating one token each) are blocked by the long prefill operation. This creates what the Sarathi-Serve paper (Agrawal et al., OSDI 2024) calls the "prefill bubble" -- a stall in decode throughput caused by a single request's prefill.

The concrete impact: if prefilling a 2000-token prompt takes 80ms and your decode TPOT target is 30ms, every request in the batch misses its TPOT deadline during that prefill step. With frequent new request arrivals, decode requests experience unpredictable latency spikes.

This motivates two solutions:

  • Chunked prefill (Section 3.5) -- break prefills into smaller pieces and interleave them with decode steps
  • Prefill-decode disaggregation (Section 9) -- run prefill and decode on entirely separate hardware

3.5 Chunked Prefill (Sarathi / Sarathi-Serve)

Chunked prefill, introduced by SARATHI (Agrawal et al., 2023), splits long prefill prompts into fixed-size chunks (say, 512 tokens) and processes one chunk per iteration alongside the ongoing decode requests.

The key idea is decode-maximal batching: each iteration contains one prefill chunk plus as many decode requests as will fit. The prefill chunk saturates the GPU's compute capacity, and the decode requests "piggyback" at almost no additional cost because decode is memory-bound and the compute is free.

Without chunked prefill:
  Step 1: [Prefill 2000 tokens] ← decode requests BLOCKED
  Step 2: [Decode 64 requests]
  Step 3: [Decode 64 requests]

With chunked prefill (chunk_size=512):
  Step 1: [Prefill chunk 1 (512 tokens)] + [Decode 64 requests]
  Step 2: [Prefill chunk 2 (512 tokens)] + [Decode 64 requests]
  Step 3: [Prefill chunk 3 (512 tokens)] + [Decode 64 requests]
  Step 4: [Prefill chunk 4 (512 tokens)] + [Decode 64 requests]

Decode requests are never blocked. The prefill takes 4 steps instead of 1, but the total system throughput is higher because decode never stalls. Sarathi-Serve extended this with stall-free scheduling, achieving up to 5.6x higher serving capacity on Falcon-180B compared to vLLM.

Chunked prefill is now standard. Both vLLM and SGLang have adopted it -- this is not an experimental technique.


4. Memory Management and PagedAttention

4.1 The Memory Fragmentation Problem

As we saw in Section 2.4, naive contiguous allocation wastes 60-80% of KV cache memory. Let me make this concrete.

Consider 8 requests on an 80 GB GPU. You pre-allocate 2048 token positions per request. Actual token counts at a given moment: [128, 256, 512, 1024, 156, 300, 890, 1800].

  • Total pre-allocated: 8×2048=16,3848 \times 2048 = 16{,}384 token positions
  • Total actually used: 128+256+512+1024+156+300+890+1800=5,066128 + 256 + 512 + 1024 + 156 + 300 + 890 + 1800 = 5{,}066 positions
  • Wasted: 69%

And this is the good case -- when requests fit neatly into memory. The real problem comes when a 9th request arrives and needs a 2048-position contiguous block, but the free memory is scattered across dozens of non-contiguous gaps left by completed requests.

4.2 The OS Analogy: Virtual Memory

Operating systems faced this exact problem decades ago. Early computers allocated contiguous physical memory to each process. When processes terminated, memory became fragmented. The solution -- invented in the 1960s -- was virtual memory with paging.

The idea: give each process a virtual address space that looks contiguous. Behind the scenes, the OS maps virtual pages to physical frames scattered anywhere in RAM. A page table tracks the mapping. The process thinks it has one big contiguous block; physical memory can be in a hundred pieces.

PagedAttention applies this same idea to KV caches.

4.3 PagedAttention: How It Works

PagedAttention (Kwon et al., SOSP 2023) divides KV cache memory into fixed-size blocks -- typically 16 tokens per block. Each request gets a block table that maps logical positions to physical GPU memory blocks.

Here is how it works step by step:

  1. A request arrives. The system allocates blocks for its initial prefill, one at a time, from a free list.
  2. As the request generates tokens, new blocks are allocated on demand when the current block fills up.
  3. Each request's block table maps logical block indices (0, 1, 2, ...) to physical block addresses in GPU memory.
  4. The attention kernel reads from physical blocks using the block table for indirection -- the KV cache does not need to be contiguous.
  5. When a request finishes, all its blocks return to the free list.
# Simplified PagedAttention block allocation
class PagedKVCacheManager:
    def __init__(self, num_blocks, block_size=16):
        self.block_size = block_size
        self.free_blocks = list(range(num_blocks))  # Pool of physical blocks
        self.block_tables = {}  # request_id -> [physical_block_ids]

    def allocate(self, request_id):
        """Allocate a new block for a request."""
        if not self.free_blocks:
            return None  # Out of memory -- trigger preemption
        block = self.free_blocks.pop()
        self.block_tables.setdefault(request_id, []).append(block)
        return block

    def free(self, request_id):
        """Free all blocks belonging to a finished request."""
        blocks = self.block_tables.pop(request_id, [])
        self.free_blocks.extend(blocks)  # Return blocks to pool

Let me walk through a concrete example. Say we have GPU memory divided into 1000 blocks of 16 tokens each (16,000 token positions total). Three requests arrive:

Request A (prompt: 100 tokens, generating): Needs ceil(100/16) = 7 blocks initially. The block table maps logical blocks [0, 1, 2, 3, 4, 5, 6] to physical blocks [42, 817, 3, 291, 550, 12, 666]. These physical blocks can be anywhere in GPU memory -- they do not need to be contiguous.

Request B (prompt: 50 tokens, generating): Gets 4 blocks: logical [0, 1, 2, 3] mapped to physical [100, 201, 99, 500].

Request A finishes: Its 7 blocks return to the free list. Physical blocks [42, 817, 3, 291, 550, 12, 666] are now available for any new request.

Request C (prompt: 200 tokens) arrives: Gets 13 blocks from the free list. Some might be from Request A's freed blocks, some from blocks that were always free. The request does not care -- it sees a contiguous logical address space.

The attention kernel in vLLM uses the block table to look up where each logical block lives in physical memory. This adds a level of indirection (one table lookup per block), but the overhead is minimal compared to the memory savings.

Near-zero waste: The only wasted memory is in the last block of each request, which may not be full. With 16 tokens per block, the average waste per request is 8 tokens -- less than 4% for any reasonable sequence length. For a request with 1000 tokens, that is 8/1000 = 0.8% waste.

External fragmentation eliminated: Blocks are scattered in physical memory by design. A new request can always use available blocks regardless of their physical locations. There is no concept of "not enough contiguous space."

Contiguous vs. Paged KV Cache Allocation

24 memory blocks, 4 tokens/block. Each square = 1 block.

Req 0: 12/32 tokens
Req 1: 6/32 tokens
Req 2: 20/32 tokens
Contiguous Allocation (pre-allocate max)
R0
R0
R0
~
~
~
~
~
R1
R1
~
~
~
~
~
~
R2
R2
R2
R2
R2
~
~
~
Used: 10 blocksWasted: 14 blocksEfficiency: 42%
Paged Allocation (allocate on demand)
R1
R1
R2
R2
R2
R2
R2
R0
R0
R0
Used: 10 blocksFree: 14 blocksEfficiency: ~85%
Key insight: Contiguous allocation pre-reserves max_seq_len for every request, wasting blocks on tokens not yet generated. Paged allocation only uses blocks for tokens that exist, and blocks can be anywhere in physical memory.

Figure 3: PagedAttention memory management. Each request's KV cache is stored in non-contiguous blocks mapped through a block table. When requests finish, their blocks return to the free list immediately. Compare this to contiguous allocation, where fragmentation accumulates over time.

4.4 Copy-on-Write for Parallel Sampling

Beam search and parallel sampling generate multiple output sequences from the same prompt. Without PagedAttention, you must duplicate the prompt's KV cache for each sequence -- a huge waste since they share the same prefix.

PagedAttention supports copy-on-write (another OS concept): all sequences sharing a prefix point to the same physical blocks. Only when a sequence diverges (writes a different token to a position) is the affected block copied. The vLLM paper reports up to 55% memory savings for beam search.

I think PagedAttention is one of the most elegant ideas in recent systems research -- taking a 50-year-old OS concept and applying it to a completely new domain. The insight is that KV caches have the same properties as process memory: variable-length, dynamic, and fragmentation-prone. Once you see the analogy, the solution follows naturally.

4.5 Performance Results

vLLM with PagedAttention achieves:

  • 2-4x throughput over state-of-the-art systems like FasterTransformer and Orca
  • >96% memory utilization (vs. 20-40% with contiguous allocation)

These numbers are from the original paper. In practice, the gains are largest when serving many concurrent requests with variable-length outputs -- exactly the common production scenario.


5. Request Scheduling

5.1 The Scheduling Problem

Traditional web servers are stateless: each request arrives, gets processed, and the server forgets about it. LLM requests are different -- each one holds a KV cache in GPU memory for its entire lifetime. Admitting a new request means committing memory that will be occupied for seconds or minutes.

The scheduler must balance four competing objectives:

  • Throughput: Process as many tokens per second as possible
  • TTFT: Get first tokens to users quickly
  • TPOT: Keep the per-token generation speed steady
  • Fairness: Don't starve long-running requests

5.2 FCFS vs. Shortest-Job-First

First-Come-First-Served (FCFS): Process requests in arrival order. Simple, fair, but suffers from head-of-line blocking. A long request (say, "write me a 5000-word essay") delays all subsequent short requests ("translate this sentence").

Shortest-Job-First (SJF): Prioritize short requests to minimize average latency. The problem: we do not know the output length ahead of time. The user's prompt gives some hints (a translation request is probably shorter than an essay request), but the actual output length is only known when the model emits EOS.

In practice, systems use FCFS as the baseline with two escape valves: preemption (evict a running request when memory is full) and priority scheduling (bump certain requests ahead of others based on heuristics or business logic).

5.3 Preemption Strategies

When a new high-priority request arrives and GPU memory is full, the scheduler has two options:

Option 1: Swapping. Move a lower-priority request's KV cache to CPU memory via PCIe (the standard CPU-to-GPU bus). When the request can resume, copy it back. The cost:

  • Llama 3 70B, 4K context: 1.25 GB over PCIe Gen4 (~25 GB/s effective unidirectional) takes ~50ms each way
  • Llama 3 70B, 32K context: 10 GB = ~400ms each way
  • Swapping preserves all computation -- no work is redone
  • Requires CPU memory to store the swapped cache

Option 2: Recomputation. Discard the lower-priority request's KV cache entirely. When it resumes, re-run its prefill from scratch. The cost:

  • 2K-token prompt on Llama 3 70B: ~40-80ms (compute-bound)
  • 32K-token prompt: ~600-1200ms (much more expensive)
  • No CPU memory needed
  • All generation progress since prefill is preserved (only the KV cache is recreated)

The decision is straightforward: for short contexts (under ~4K tokens), recomputation is usually faster because the prefill is cheap and you avoid the round-trip transfer. For long contexts (over ~8K tokens), swapping wins because recomputation is expensive. Between 4K and 8K, it depends on your PCIe bandwidth and GPU compute speed.

vLLM supports both strategies and can be configured to choose automatically based on context length.

Request Scheduling: FCFS vs. Shortest-Job-First

5 requests with different arrival times and generation lengths. Dashed = waiting. Solid = processing.

First-Come, First-Served (FCFS)Avg TTFT: 11.6 steps | Makespan: 29
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Req 0
12s
Req 1
3s
Req 2
5s
Req 3
2s
Req 4
7s
Shortest-Job-First (SJF)Avg TTFT: 10.8 steps | Makespan: 29
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Req 0
12s
Req 3
2s
Req 1
3s
Req 2
5s
Req 4
7s
Key insight: SJF reduces average waiting time (TTFT), but requires knowing output length in advance — which is impossible for LLMs. In practice, systems use preemption: start processing in FCFS order, but swap out long requests if shorter ones are waiting.

Figure 4: Request scheduling under different policies. Notice how FCFS leads to head-of-line blocking (short requests wait for long ones), while preemptive scheduling can evict and swap requests to maintain latency targets.

5.4 Scheduling for SLO Compliance

In production, serving systems have SLOs: "99th percentile TTFT < 500ms" or "TPOT < 50ms". The scheduler's job is not just to maximize throughput -- it is to maximize throughput within SLO constraints (this is what the DistServe paper calls "goodput").

Mooncake (Qin et al., FAST 2025) uses prediction-based early rejection to maintain SLO compliance (detailed in Section 9.6).

Sarathi-Serve uses stall-free scheduling that prevents prefill from blocking decode, as we discussed in Section 3.5. By guaranteeing that decode requests always make progress, it can provide TPOT guarantees even under heavy prefill load.


Part III: Scaling and Optimization

6. Parallelism Strategies

6.1 Why Parallelism is Necessary

A 70B model at FP16 = 140 GB. One H100 = 80 GB HBM. The model does not fit.

Even with INT4 quantization (35 GB), you still need room for KV caches, activations, and the CUDA runtime. Multi-GPU serving is almost always required for models above ~13B parameters if you want reasonable throughput.

6.2 Tensor Parallelism (TP)

Tensor parallelism splits each layer's weight matrices across GPUs. Each GPU holds a "slice" of every layer and processes its slice of every token.

The technique was formalized by Megatron-LM (Shoeybi et al., 2020). Here is how it works for the attention and MLP layers:

MLP parallelism: For a two-layer MLP (Y=GeLU(XA)BY = \text{GeLU}(XA) \cdot B), matrix AA is split column-wise across GPUs. Each GPU computes its portion of the hidden dimension independently. Then matrix BB is split row-wise, and each GPU computes its portion of the output. A single all-reduce (a distributed operation where every GPU sums its partial results so all end up with the complete output) after the second matmul synchronizes the results across GPUs.

TP=4 MLP example:
GPU 0: X @ A[:, 0:H/4] → GeLU → @ B[0:H/4, :] → partial_output_0
GPU 1: X @ A[:, H/4:H/2] → GeLU → @ B[H/4:H/2, :] → partial_output_1
GPU 2: X @ A[:, H/2:3H/4] → GeLU → @ B[H/2:3H/4, :] → partial_output_2
GPU 3: X @ A[:, 3H/4:H] → GeLU → @ B[3H/4:H, :] → partial_output_3
                                                    ↓
                                               all-reduce (sum)
                                                    ↓
                                                final output

Attention parallelism: The attention heads are split across GPUs. With 64 heads and TP=8, each GPU handles 8 heads. The QKV projections and output projection are partitioned accordingly. Each GPU computes attention for its heads independently, then an all-reduce combines the results.

The communication cost is 2 all-reduces per layer (one for attention, one for MLP). For a 70B model with 80 layers, that is 160 all-reduces per forward pass. On NVLink (NVIDIA's high-speed GPU-to-GPU interconnect, 900 GB/s bidirectional on H100), each all-reduce for a small activation tensor takes about 10-50 microseconds. The total communication overhead is a few milliseconds -- acceptable within a node but prohibitive across nodes where bandwidth drops to 50-100 GB/s.

  • (+) Reduces latency: all GPUs work on every token in parallel
  • (+) The most common approach for serving within a single node
  • (+) Memory perfectly balanced: each GPU holds 1/TP of the weights and KV cache
  • (-) Requires fast interconnect (NVLink at 900 GB/s) because GPUs communicate after every layer
  • (-) Typical values: TP=2, 4, or 8 within a single node

6.3 Pipeline Parallelism (PP)

Pipeline parallelism splits layers across GPUs: GPU 0 gets layers 0-19, GPU 1 gets layers 20-39, and so on. Each GPU processes its layers and passes the result (the hidden state -- a relatively small tensor) to the next GPU.

This was pioneered by GPipe (Huang et al., NeurIPS 2019), which introduced micro-batch pipelining: instead of processing one full batch sequentially through the pipeline (which leaves most GPUs idle), GPipe splits the batch into micro-batches that flow through the pipeline concurrently.

Naive PP (one micro-batch):
GPU 0: [████████]  [idle]      [idle]      [idle]
GPU 1: [idle]      [████████]  [idle]      [idle]
GPU 2: [idle]      [idle]      [████████]  [idle]
GPU 3: [idle]      [idle]      [idle]      [████████]
                                              ↑ 75% idle!

Micro-batch PP (4 micro-batches):
GPU 0: [mb1][mb2][mb3][mb4]
GPU 1:      [mb1][mb2][mb3][mb4]
GPU 2:           [mb1][mb2][mb3][mb4]
GPU 3:                [mb1][mb2][mb3][mb4]
                                        ↑ much less idle time

The communication is minimal: only the activation tensor (batch_size x hidden_dim, typically a few MB) needs to pass between adjacent stages. This makes PP work well over slower inter-node interconnects like InfiniBand (a high-bandwidth data center networking fabric, 100-400 Gb/s).

Megatron-LM v2 (Narayanan et al., SC 2021) showed how to combine TP within a node and PP across nodes, achieving efficient scaling to thousands of GPUs for both training and inference.

  • (+) Requires less communication (only between adjacent stages, not all-reduce)
  • (+) Works across nodes with slower interconnects (InfiniBand at 100-400 Gb/s)
  • (+) Each GPU only stores a fraction of the layers (good for very large models)
  • (-) Pipeline bubbles: GPU 0 is idle while GPU 3 processes (reduced but not eliminated by micro-batching)
  • (-) Higher latency for individual requests (must pass through all stages sequentially)

6.4 Sequence Parallelism (SP)

When context lengths exceed what fits in a single GPU's KV cache memory, you can split the sequence across GPUs. Each GPU stores a portion of the KV cache and processes attention for its segment.

The key mechanism is ring attention: GPUs are arranged in a logical ring. Each GPU computes attention for its local KV cache segment, then passes its K/V blocks to the next GPU in the ring. After a full rotation, every GPU has attended to the complete sequence. The communication pattern is elegant -- only nearest-neighbor communication is needed, and it can overlap with computation.

Note: the term "sequence parallelism" is sometimes used in the Megatron-LM literature to refer specifically to partitioning LayerNorm and Dropout along the sequence dimension. Here we use it more broadly to describe any technique that splits the sequence across GPUs, including context parallelism and ring attention (Liu et al., 2023).

Sequence parallelism is primarily useful for very long contexts (100K+ tokens) where the KV cache for a single request does not fit in one GPU's memory, even after the model weights are distributed via tensor parallelism. For example, Llama 3 70B at 128K context requires 40 GB of KV cache per request in FP16 -- with TP=4, each GPU holds 10 GB of KV cache for just one request, leaving little room for batching.

6.5 Expert Parallelism (EP) for MoE Models

Mixture-of-Experts (MoE) models like DeepSeek V3 (671B total parameters, 256 experts, 37B active per token) need special treatment. The model has 256 expert MLPs, but each token only activates 8 of them. Storing all 256 experts on every GPU wastes memory -- most experts sit idle for any given token.

Expert parallelism assigns each GPU a subset of experts. When a token is routed to an expert on another GPU, the token embedding is sent to that GPU via all-to-all communication (each GPU sends data to every other GPU). The expert processes the token and sends the result back.

The communication pattern is expensive -- all-to-all requires every GPU to send data to every other GPU, which can overwhelm network bandwidth. This is why EP is typically used across nodes (where you have fast InfiniBand) while TP handles the dense layers within a node.

DeepSeek V3 uses a combination: TP=4 within each 8-GPU node for the attention layers, and EP across nodes for the expert layers. The routing decisions (which tokens go to which experts) are made by a learned gating network that runs on all GPUs.

6.6 Choosing a Parallelism Strategy

ScenarioRecommendationWhy
Model fits in 1 GPUNo parallelism neededSimplest
Model fits in 1 node (2-8 GPUs)TP within nodeLow latency, NVLink available
Model spans multiple nodesTP within node + PP across nodesPP tolerates slower inter-node links
100K+ context on 70B modelTP + SPKV cache too large for one GPU
MoE model (DeepSeek V3, Mixtral)TP + EPExperts need dedicated GPUs
Disaggregated servingDifferent strategies per phaseSee Section 9

In my opinion, the DistServe insight -- that prefill and decode benefit from different parallelism strategies -- is one of those ideas that seems obvious in hindsight but fundamentally changes how you design serving systems. Prefill wants high TP (reduce latency of one big computation). Decode wants lower TP or PP (maximize throughput of many small computations).

Parallelism Strategies Compared

Split each layer's weight matrices across GPUs. All GPUs compute every layer in parallel, communicating after each one.

GPU 0
Layer 1-80 (cols 0-2047)
GPU 1
Layer 1-80 (cols 2048-4095)
GPU 2
Layer 1-80 (cols 4096-6143)
GPU 3
Layer 1-80 (cols 6144-8191)
Communication: All-reduce after every layer (80x per forward pass)
Advantages
  • + Lowest latency (all GPUs work on every token)
  • + Simple load balancing
Disadvantages
  • - Requires fast interconnect (NVLink)
  • - All-reduce after every layer
  • - Scales poorly beyond 8 GPUs
Best For

Co-located GPUs on the same node. Latency-sensitive prefill.

Figure 5: Parallelism strategies compared. Tensor parallelism splits each layer across GPUs (requiring all-reduce after every layer). Pipeline parallelism assigns different layers to different GPUs (requiring communication only between stages). The optimal strategy depends on your model size, hardware topology, and serving requirements.


Parallelism spreads the model across GPUs. Quantization shrinks the model so you need fewer GPUs -- or can fit more requests on the same ones.

7. Quantization for Serving

7.1 Why Quantization Matters for Serving

Quantization reduces numerical precision (e.g., FP16 to INT4 or FP8). For serving, the benefits hit at every level:

  1. Smaller model: Fewer GPUs needed. A 70B model at FP16 (140 GB) becomes 35 GB at INT4 -- fits on one H100.
  2. Smaller KV cache: FP8 KV cache is half the size of FP16. Double your concurrent requests.
  3. Faster weight loading: During decode (memory-bound), smaller weights transfer faster. 2x smaller = 2x faster.
  4. Hardware acceleration: H100 FP8 Tensor Cores run at 2x the throughput of FP16 Tensor Cores.

7.2 Weight-Only Quantization

Keep activations in FP16, quantize only the weights to INT4 or INT8. During matrix multiplication, weights are dequantized on the fly.

Why it works during decode: Decode is memory-bound -- the bottleneck is loading weights from HBM. Smaller weights transfer faster. The dequantization cost is minimal compared to the memory bandwidth savings.

Why it's less effective during prefill: Prefill is compute-bound -- you're already saturating the compute units. Smaller weights don't help because memory bandwidth isn't the bottleneck, and dequantization adds overhead.

GPTQ (Frantar et al., ICLR 2023): Uses Hessian-based optimization to quantize weights layer by layer. The first method to push 4-bit quantization on 175B parameter models with negligible quality loss. Achieves 3.25x speedup on A100.

AWQ (Lin et al., MLSys 2024 Best Paper): Observes that not all weights are equally important -- protecting 1% of salient channels (identified by activation magnitudes) greatly reduces quantization error. More robust than GPTQ across models and tasks. Comparable or faster inference speed than GPTQ with better quality retention.

GGUF/llama.cpp: Per-tensor quantization in a variety of formats (Q4_K_M, Q5_K_S, etc.) optimized for CPU inference. The standard for running models locally on laptops and desktops.

7.3 Weight-and-Activation Quantization (W8A8)

Quantizing both weights and activations enables INT8 or FP8 Tensor Core operations -- actual compute speedup, not just memory savings.

SmoothQuant (Xiao et al., ICML 2023): The challenge is that activations have outliers ~100x larger than typical values. If the largest activation value is 100 and the smallest is 0.01, INT8 quantization must map the range [0.01, 100] into 256 levels -- wasting most of the precision on the outlier range while values near 0.01 get crushed to zero.

SmoothQuant's solution: divide each activation channel by a smoothing factor ss, and multiply the corresponding weight channel by ss. This is mathematically equivalent (the factors cancel in the matrix multiply), but the activation range shrinks while the weight range expands:

Y=(Xdiag(s)1)(diag(s)W)=X^W^Y = (X \text{diag}(s)^{-1}) \cdot (\text{diag}(s) W) = \hat{X} \hat{W}

The smoothed activations X^\hat{X} have smaller outliers (easier to quantize). The scaled weights W^\hat{W} have larger range (still easy to quantize since weights are already well-behaved). Result: both fit nicely into INT8. Achieves 1.56x speedup and 2x memory reduction.

FP8 (W8A8): The H100's native FP8 support -- E4M3 format (4 exponent bits, 3 mantissa bits -- better precision for inference) and E5M2 format (5 exponent bits, 2 mantissa bits -- wider range for gradients) -- has become the default for production serving on Hopper GPUs. H100's native FP8 support provides 2x the Tensor Core throughput of FP16, translating to 30-50% serving throughput improvement with 99-100% quality retention on most tasks. This is the sweet spot in 2026: meaningful speedup with negligible quality loss.

7.4 KV Cache Quantization

You can quantize the KV cache independently of the model weights. This directly addresses the memory bottleneck:

  • FP8 KV cache = 2x more concurrent requests (or 2x longer contexts) at the same memory budget
  • vLLM supports this natively: --kv-cache-dtype fp8
  • Quality impact is minimal at 8-bit; degrades noticeably at 4-bit for complex reasoning tasks

7.5 The Accuracy-Throughput Frontier

MethodPrecisionQuality RetentionThroughput GainHardware
FP16 baselineW16A16100%1xAny GPU
FP8W8A899-100%1.5-2xH100+
SmoothQuantW8A8 (INT)98-99%1.5xA100+
AWQW4A1697-99%2-3xAny GPU
GPTQW4A1696-99%2-3xAny GPU
GGUF Q4_K_MW4A1695-98%2-3x (CPU)CPU

My recommendation: FP8 on H100 is the default choice for production. If you need maximum memory efficiency, combine INT4 weight quantization (AWQ) with FP8 KV cache. If you're running on consumer hardware, GGUF formats with llama.cpp.

Here is what this looks like in practice with vLLM:

# FP8 quantization on H100 -- the sweet spot for production
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-70B-Instruct \
    --quantization fp8 \
    --kv-cache-dtype fp8 \
    --tensor-parallel-size 4 \
    --max-model-len 8192 \
    --enable-prefix-caching

This single command gives you FP8 model weights, FP8 KV cache (2x more requests), tensor parallelism across 4 GPUs, and automatic prefix caching. Before frameworks like vLLM and SGLang existed, achieving this combination -- FP8 quantization, prefix caching, tensor parallelism -- required building custom CUDA kernels. Today, it is a single CLI command.

Quantization Methods: Quality vs. Throughput vs. Memory

Filter:Sort:
MethodPrecisionQualityThroughputMemoryHardware
FP16 (baseline)W16A16
100%
1.0x100%Any GPU
FP8 (H100)W8A8
99.7%
1.8x50%H100/H200 (FP8)
SmoothQuantW8A8
99.5%
1.5x50%INT8 Tensor Cores
GGUF Q8_0W8A16
99%
0.6x50%CPU/GPU
AWQW4A16
98.5%
1.4x27%Any GPU
GPTQW4A16
97.5%
1.3x27%Any GPU
INT4 + FP8 KVW4A16 + FP8 KV
97%
1.5x25%H100 (FP8 KV)
GGUF Q4_K_MW4A16
96%
0.8x27%CPU/GPU
Weight+Activation Weight-Only Combined Baseline

Figure 6: Quantization methods compared on the accuracy-throughput frontier. FP8 on H100 achieves the best ratio of throughput gain to quality loss. Weight-only INT4 methods (AWQ, GPTQ) are best when memory is the primary constraint.


8. Speculative Decoding

8.1 The Core Observation

Here is a surprising fact about LLM decode: generating 1 token costs almost the same as verifying 5 tokens in parallel.

Why? The bottleneck during decode is loading model weights from HBM. Whether you multiply those weights by 1 token or 5 tokens, the data transfer cost is nearly identical -- you are reading the same weight matrices either way. The additional compute for 5 tokens is negligible because the GPU is memory-bound.

This means: if we can guess the next K tokens cheaply, we can verify all K in one forward pass of the target model, potentially accepting most or all of them. The expected number of accepted tokens depends on both the draft length K and the acceptance rate alpha, following a geometric series (we will derive this shortly).

I suspect speculative decoding is still underutilized in production because teams struggle with the operational complexity of maintaining two models. The self-speculative methods I'll describe below are changing this.

8.2 Draft-Then-Verify

Speculative decoding (Leviathan et al., ICML 2023; independently, Chen et al., 2023) works in three phases:

  1. Draft: A small, fast "draft model" generates K candidate tokens autoregressively.
  2. Verify: The full target model processes all K tokens in one forward pass, computing the probability of each.
  3. Accept/reject: For each position, use rejection sampling to decide whether to accept the draft token or resample from the target distribution.
def speculative_decode(target_model, draft_model, prompt, K=5):
    """Generate tokens with speculative decoding."""
    tokens = prompt

    while not tokens.endswith(EOS):
        # Phase 1: Draft K tokens cheaply
        draft_tokens = []
        draft_probs = []
        for _ in range(K):
            p_draft = draft_model.get_probs(tokens + draft_tokens)
            token = sample(p_draft)
            draft_tokens.append(token)
            draft_probs.append(p_draft)

        # Phase 2: Verify all K tokens in ONE forward pass
        target_probs = target_model.get_probs_batch(tokens, draft_tokens)

        # Phase 3: Accept or reject each token
        accepted = []
        for i, token in enumerate(draft_tokens):
            p_target = target_probs[i][token]
            p_draft_token = draft_probs[i][token]

            # Accept with probability min(1, p_target / p_draft_token)
            if random() < min(1.0, p_target / p_draft_token):
                accepted.append(token)
            else:
                # Reject: sample from residual distribution
                residual = max(0, target_probs[i] - draft_probs[i])
                residual = residual / residual.sum()
                accepted.append(sample(residual))
                break  # Stop accepting after first rejection
        else:
            # All K tokens accepted -- sample bonus token from target
            bonus = sample(target_probs[K])
            accepted.append(bonus)

        tokens.extend(accepted)

    return tokens

8.3 Why It Works: The Rejection Sampling Guarantee

This is the key insight that makes speculative decoding special: the output distribution is identical to standard autoregressive sampling from the target model. It is not an approximation. The name "speculative decoding" is deliberate -- it borrows from speculative execution in CPU design, where the processor guesses ahead and rolls back if the guess is wrong.

Here is the intuition first, then the math.

Intuition: For each draft token, we ask: "would the target model have produced this token?" If the target assigns equal or higher probability than the draft model, we always accept -- the draft model guessed something the target model likes. If the target assigns lower probability, we accept with reduced probability proportional to the target's preference. If we reject, we sample from the "leftover" distribution -- the tokens the target model likes but the draft model did not favor enough.

The math: For each position, let q(x)q(x) be the draft model's probability and p(x)p(x) be the target model's probability:

  • Accept with probability min(1,p(x)q(x))\min\left(1, \frac{p(x)}{q(x)}\right)
  • If rejected, sample from the residual distribution: p(x)=max(0,p(x)q(x))xmax(0,p(x)q(x))p'(x) = \frac{\max(0, p(x) - q(x))}{\sum_x \max(0, p(x) - q(x))}

Why this produces samples from exactly p(x)p(x): Consider two cases for a specific token xx:

Case 1: p(x)q(x)p(x) \geq q(x) (target likes this token at least as much as the draft). The acceptance probability is 1 -- always accept. The contribution to the output distribution from this path is q(x)1=q(x)q(x) \cdot 1 = q(x). The remaining p(x)q(x)p(x) - q(x) comes from the rejection path: the probability of rejection is β=xmax(0,q(x)p(x))\beta = \sum_x \max(0, q(x) - p(x)), and the residual distribution contributes βp(x)q(x)xmax(0,p(x)q(x))=p(x)q(x)\beta \cdot \frac{p(x) - q(x)}{\sum_x \max(0, p(x) - q(x))} = p(x) - q(x) (the two sums are equal by construction). So the total is q(x)+(p(x)q(x))=p(x)q(x) + (p(x) - q(x)) = p(x).

Case 2: p(x)<q(x)p(x) < q(x) (draft overestimates this token). Accept with probability p(x)/q(x)p(x)/q(x). The accepted contribution is q(x)p(x)/q(x)=p(x)q(x) \cdot p(x)/q(x) = p(x). Plus some contribution from the residual when other tokens are rejected. The full derivation shows:

P(output=x)=q(x)min(1,p(x)q(x))+P(reject)p(x)=p(x)P(\text{output} = x) = q(x) \cdot \min\left(1, \frac{p(x)}{q(x)}\right) + P(\text{reject}) \cdot p'(x) = p(x)

The better the draft model matches the target, the higher the acceptance rate. A perfect draft model (q=pq = p) would accept every token, achieving K×K\times speedup. In practice, acceptance rates of 70-85% are typical with a well-matched draft model (e.g., Llama 3 8B drafting for Llama 3 70B).

Expected tokens per step: If the acceptance rate is α\alpha and we draft KK tokens, the expected number of accepted tokens is:

E[accepted]=1αK+11αE[\text{accepted}] = \frac{1 - \alpha^{K+1}}{1 - \alpha}

For α=0.8\alpha = 0.8 and K=5K = 5: E[accepted]3.7E[\text{accepted}] \approx 3.7 tokens per step. Since the verification step costs roughly the same as one standard decode step (both load the full model weights), this is a ~3.7x speedup.

This follows from the geometric series: the probability of accepting at least ii tokens is αi\alpha^i, so E[accepted]=i=0Kαi=1αK+11αE[\text{accepted}] = \sum_{i=0}^{K} \alpha^i = \frac{1 - \alpha^{K+1}}{1 - \alpha}.

Speculative Decoding: Step by Step

Step 1: Draft Model Generates Candidates

The small draft model generates K=4 candidate tokens autoregressively.

PositionToken 1Token 2Token 3Token 4
Draft tokenThequickbrownfox
p_draft0.850.720.650.90

Figure 7: Speculative decoding step-by-step. The draft model generates K candidate tokens quickly, the target model verifies them all in one forward pass, and rejection sampling ensures the output matches the target distribution exactly.

8.4 Draft Model Selection

The classic approach: use a smaller model from the same family. For Llama 3 70B as the target, use Llama 3 8B as the draft. The draft model should be 5-10x smaller than the target -- small enough to generate quickly, large enough to match the target's distribution reasonably well.

The acceptance rate depends heavily on how well the draft model matches the target. Some rules of thumb from practice:

  • Same family, smaller size (Llama 3 8B for Llama 3 70B): 70-85% acceptance rate. Best general approach.
  • Different family, similar training data: 50-70% acceptance rate. Works but less efficient.
  • Quantized version of target (INT4 of target as draft): 75-90% acceptance rate. High match but draft is still slow (same architecture, just lighter).

The challenge: maintaining two models increases memory requirements. Llama 3 8B at FP16 is 16 GB -- on top of the 70B model's 140 GB (or more, after KV caches). On 8 H100s (640 GB), the draft model adds 2.5% memory overhead, which is manageable. But on smaller setups, the draft model's memory can be significant.

Operational complexity is the bigger concern: both models must use the same tokenizer, both need to be updated when you fine-tune or upgrade, and the serving system must orchestrate the draft-verify-accept loop.

8.5 Self-Speculative Decoding (No Separate Draft Model)

Medusa (Cai et al., ICML 2024): Add K extra MLP "heads" to the target model. Each head predicts a different future token from the same hidden state. Tree-structured verification allows checking multiple candidate continuations simultaneously. Achieves 2.2-3.6x speedup.

  • (+) No separate draft model. One model to deploy, manage, and update.
  • (-) Requires training the extra heads (Medusa-1: heads only; Medusa-2: heads + backbone fine-tune).

EAGLE (Li et al., ICML 2024): Autoregressively predicts at the feature level (second-to-last layer features) rather than the token level, where autoregression is more regular and predictable. EAGLE-2 (EMNLP 2024) added dynamic draft trees. The EAGLE family achieves 3.0-6.5x speedup (EAGLE-2: 3.0-4.3x, EAGLE-3: 4-6x) -- currently one of the fastest speculative decoding methods available.

For a comprehensive taxonomy of speculative decoding approaches, see the survey by Xia et al., ACL 2024.

MethodRequires Draft Model?SpeedupTraining NeededMemory Overhead
Classic Speculative DecodingYes (separate model)2-3xNoHigh (2 models)
MedusaNo (extra heads)2.2-3.6xYes (heads)Low (~1B params)
EAGLENo (feature predictor)2-3xYes (predictor)Low (~1B params)
EAGLE-2/3No (dynamic trees)3.0-6.5xYes (predictor)Low (~1B params)

Part IV: Advanced Architecture

9. Prefill-Decode Disaggregation

9.1 The Interference Problem

Sections 3-8 presented techniques that optimize serving on a fixed set of GPUs -- better batching, better memory management, better scheduling. But they all share one assumption: prefill and decode run on the same GPUs. This creates a fundamental conflict:

  • A long prefill stalls all ongoing decodes (increases TPOT, as we saw in Section 3.4)
  • Decode requests consume memory and batch slots that could be used for new prefills (increases TTFT for waiting requests)
  • Prefill wants high tensor parallelism and small batch size; decode wants low parallelism and large batch size

You cannot optimize for both TTFT and TPOT simultaneously on shared hardware. This is the motivation for disaggregation: run prefill and decode on separate GPU pools.

9.2 The Disaggregation Idea

Assign prefill to one set of GPUs and decode to another. After prefill completes, transfer the KV cache to a decode GPU.

Each phase gets its own optimal configuration:

  • Prefill GPUs: High TP (4-8) for low TTFT. Process prompts as fast as possible.
  • Decode GPUs: Optimized for memory bandwidth. Pack as many concurrent requests as possible per GPU.

The cost: you must transfer the KV cache from prefill to decode GPUs. For Llama 3 70B at 4K context, that is ~1.25 GB per request. Over NVLink (900 GB/s), the transfer takes ~1.4ms. Over InfiniBand (400 Gb/s = 50 GB/s), about 25ms.

9.3 Three System Comparison

Three major systems have taken different approaches to disaggregation:

FeatureDistServeSplitwiseMooncake
PaperZhong et al., OSDI 2024Patel et al., ISCA 2024Qin et al., FAST 2025
Core ideaGoodput-optimized disaggregationHeterogeneous hardware per phaseKVCache-centric distributed architecture
HardwareHomogeneous (same GPUs)Heterogeneous (H100 prefill, A100 decode)Homogeneous + CPU/SSD for KV storage
KV cache locationTransferred between GPU poolsTransferred between GPU poolsDistributed pool (GPU + CPU DRAM + SSD)
SchedulerPhase-specific parallelism optimizerCost/throughput optimizerGlobal "Conductor" with KV-aware routing
Key insightPrefill and decode want different parallelismDecode doesn't need expensive GPUsKV cache is the central resource, not computation
Production useResearch systemResearch systemPowers Kimi (Moonshot AI), thousands of nodes
Main result7.4x more requests within SLO1.4x throughput at 20% lower cost75% more requests under real workloads

9.4 DistServe: Goodput-Optimized

DistServe (Zhong et al., OSDI 2024) formalizes goodput -- throughput within latency SLOs -- as the optimization target. The key insight is that prefill and decode not only have different compute profiles but also benefit from different parallelism strategies:

  • Prefill: TP=4 with 4 GPUs. Processing a 2K-token prompt in parallel across 4 GPUs minimizes TTFT. The all-reduce overhead (microseconds) is negligible compared to the compute time (tens of milliseconds).
  • Decode: TP=1 with pipeline parallelism or simply more requests per GPU. Since decode is memory-bound, distributing one request across 4 GPUs wastes 3 GPUs' compute. Better to put 4 independent requests on 4 separate GPUs.

DistServe also uses bandwidth-aware placement to minimize KV cache transfer costs. Prefill and decode instances that communicate frequently are placed on GPUs with fast interconnects.

Result: 7.4x more requests meeting SLOs, or equivalently, 12.6x tighter SLO targets compared to colocated systems. A 7.4x improvement in SLO compliance means you can serve the same workload on roughly 7x fewer GPUs -- or 7x more users on the same hardware.

9.5 Splitwise: Heterogeneous Hardware

Splitwise (Patel et al., ISCA 2024) asks a practical question: if decode does not need much compute, why pay for expensive compute GPUs?

Decode is memory-bandwidth-bound. An A100 has 2 TB/s bandwidth and 312 TFLOPS. An H100 has 3.35 TB/s bandwidth and 989 TFLOPS. For decode, the H100's 3x compute advantage is irrelevant -- bandwidth is only 1.7x better. But the H100 costs roughly 2x more.

Splitwise exploits this: assign H100s to prefill (where you need the 989 TFLOPS) and A100s to decode (where 2 TB/s bandwidth is sufficient). The paper explores several configurations:

  • Iso-performance: Match colocated throughput at 20% lower cost
  • Iso-cost: Match colocated cost with 1.4x higher throughput
  • Power-optimized: Match throughput at 10% lower power

This is a practical insight for anyone building a serving cluster. Not every GPU in your fleet needs to be the latest generation.

9.6 Mooncake: KVCache-Centric Architecture

Mooncake (Qin et al., FAST 2025 Best Paper), the production system behind Moonshot AI's Kimi chatbot, takes disaggregation furthest. Instead of treating KV cache as a side effect of computation, it treats KV cache as the first-class resource around which the entire system is designed.

The architecture has three major components:

  1. Prefill nodes: Process input prompts and generate KV caches. Optimized for compute.
  2. Decode nodes: Generate output tokens using pre-computed KV caches. Optimized for memory bandwidth.
  3. Distributed KV pool: Stores KV caches across GPU HBM, CPU DRAM, and SSD across the entire cluster. This is the key innovation -- the KV pool is a distributed storage system, not just a per-node cache.

A global scheduler called Conductor routes requests based on where their KV caches already reside. If a request's prefix is cached on node 5, Conductor routes it to node 5 to avoid a costly KV cache transfer. This ties directly into prefix caching (Section 10) -- Mooncake combines disaggregation and prefix caching into a unified system.

Mooncake handles overloaded scenarios with prediction-based early rejection: before admitting a request, the system estimates whether it can meet its SLO. If adding the request would degrade existing requests below their SLOs, the new request is rejected immediately. The rationale: it is better to reject one request with a clean error than to return slow responses for everyone.

Under real production workloads serving Kimi on A800 clusters, Mooncake handles 75% more requests. In simulated scenarios with varying context lengths, throughput improvements range from 50% to 525% compared to vLLM baselines.

Colocated vs. Disaggregated Serving

Request Queue
P+DP+DP+DP+D
GPU 0
Prefill
Decode
GPU 1
Prefill
Decode
GPU 2
Prefill
Decode
GPU 3
Prefill
Decode

Problem: Prefill and decode compete for the same GPU resources.

A long prefill stalls all ongoing decodes, spiking TPOT for every request in the batch.

Cannot independently optimize batch size or parallelism for each phase.

Figure 8: Colocated vs. disaggregated serving. In the colocated model (left), prefill and decode compete for the same GPU resources. In the disaggregated model (right), each phase gets its own GPU pool with tailored configurations. The KV cache transfer between phases is the key overhead.

9.7 The Cost of Disaggregation

Disaggregation is not free:

  • (-) KV cache transfer overhead (milliseconds to tens of milliseconds per request)
  • (-) Requires fast interconnect (NVLink, InfiniBand, or RDMA (Remote Direct Memory Access -- letting one machine read from another's memory directly, bypassing the CPU for low-latency transfers))
  • (-) Model weights must be duplicated on both prefill and decode clusters
  • (-) Significantly more engineering complexity
  • (+) Better resource utilization per dollar
  • (+) Independent scaling of prefill and decode capacity
  • (+) Eliminates prefill-decode interference

My sense is that disaggregation will be the default architecture for any large-scale deployment within the next year or two. The engineering overhead is real, but the efficiency gains are too large to ignore at scale.


10. Prefix Caching and Prompt Caching

10.1 The Redundancy Opportunity

Many LLM workloads have shared prefixes:

  • System prompts: Every request to a chatbot starts with the same 500-2000 token system prompt
  • Few-shot examples: Evaluation benchmarks prepend the same examples to every query
  • Multi-turn conversations: Each new message shares the full conversation history as a prefix
  • RAG pipelines: Requests about the same document share the retrieved context

If 100 requests share a 2000-token system prompt, a naive system computes the KV cache for those 2000 tokens 100 times. With prefix caching: compute it once, reuse it 99 times. That is 99% less prefill compute for the shared prefix.

10.2 Automatic Prefix Caching in vLLM

vLLM implements automatic prefix caching at the block level. Here is how it works:

  1. When prefill computes KV cache blocks, each block of 16 tokens is hashed: hash(token_ids[i:i+16])
  2. The hash and corresponding KV block are stored in a cache table
  3. When a new request arrives, its token blocks are hashed and checked against the table
  4. If a cache hit occurs, the cached KV block is reused -- that block's prefill computation is skipped entirely
  5. Only blocks that are cache misses need to be computed

This works transparently -- no code changes needed. Enable it with --enable-prefix-caching and vLLM handles the rest. The hash computation is negligible (microseconds) compared to the prefill compute it saves (milliseconds to seconds).

Concrete savings: If your system prompt is 1000 tokens and you serve 100 requests per minute, without prefix caching you compute the system prompt 100 times. With prefix caching, you compute it once and reuse it 99 times. For a 70B model, that saves roughly 99 x 20ms = ~2 seconds of GPU time per minute. At scale (thousands of requests per minute), the savings are substantial.

Limitations: matching is at block boundaries (16-token granularity), and only exact prefix matches count. A one-token difference in the prefix results in a full miss for the differing block and all subsequent blocks. This means prefix caching works best when prefixes are identical byte-for-byte -- which they are for system prompts and few-shot examples, but not for paraphrased content.

10.3 RadixAttention (SGLang)

SGLang (Zheng et al., NeurIPS 2024) uses a radix tree (compressed prefix tree) to store cached KV caches. A radix tree is a trie where nodes with single children are merged, making it efficient for storing strings with shared prefixes.

Here is how RadixAttention differs from vLLM's block-level hashing:

vLLM approach: Hash each 16-token block independently. Look up the hash in a flat hash table. If found, reuse the cached KV block. This is fast but rigid -- matching only works at 16-token boundaries, and there is no structural relationship between cached entries.

RadixAttention approach: Maintain a tree where each edge represents a sequence of tokens. The tree captures the hierarchical structure of cached prefixes. When a new request arrives, traverse the tree from the root, matching tokens until a mismatch. The matched portion is a cache hit; only the unmatched suffix needs computation.

Example tree structure:

Root
├── "You are a helpful assistant..." (system prompt A)
│   ├── "Translate this to French: ..." (request 1)
│   ├── "Summarize the following: ..." (request 2)
│   └── "What is the capital of ..." (request 3)
└── "You are a coding expert..." (system prompt B)
    ├── "Write a Python function ..." (request 4)
    └── "Debug this code: ..." (request 5)

All requests under system prompt A share the same KV cache for the system prompt. Requests 1, 2, and 3 only compute KV for their unique suffixes. If request 6 arrives with the same system prompt A and a new query, it finds the system prompt in the tree, reuses its KV cache, and only computes the query portion.

RadixAttention also enables cache-aware scheduling: the scheduler reorders the request queue to prioritize requests whose prefixes have the highest cache hit rates. This maximizes reuse and reduces total compute.

SGLang achieves 50-99% cache hit rates depending on the workload. For multi-call workloads (agents that make multiple LLM calls with shared context, such as tree-of-thought reasoning or multi-step tool use), RadixAttention delivers up to 5x throughput improvement. The gains are proportional to how much prefix sharing exists in the workload.

Prefix Caching with Radix Tree

Select a request to see how much of its prefix is already cached.

Cached Prefix Tree (1 entries)
root
You
are
a
helpful
assistant.
User:
What
is
attention?
Cache Result
Click "Check Cache & Send" to see results

Figure 9: Prefix caching with a radix tree. Requests sharing common prefixes (like system prompts) reuse cached KV blocks. Green nodes represent cache hits (skipped computation); red nodes represent cache misses (computed fresh).

10.4 When Prefix Caching Helps Most

WorkloadCache Hit RateBenefit
Chatbot with system prompt90-99%High (same prefix for every request)
Multi-turn conversation70-95%High (previous turns are shared)
Agents / multi-step reasoning80-95%Very high (same context, multiple calls)
RAG with few documents50-80%Moderate (shared document context)
Diverse unique promptsUnder 10%Minimal

The overhead of prefix caching is negligible (small hash table or tree lookup). There is no reason not to enable it -- it either helps or does nothing.


Part V: Practice and Future

With the techniques covered, the practical question becomes: which framework implements them best for your workload?

11. Serving Frameworks Comparison

11.1 The Field in 2026

The serving framework ecosystem has matured rapidly. In 2022, you had two choices: HuggingFace Transformers (easy but slow) or FasterTransformer (fast but hard to use). Today, there are at least six production-quality options, and all of them support continuous batching, PagedAttention (or equivalent), quantization, and multi-GPU parallelism.

The differences are in the details: which workloads each framework optimizes for, how extensible they are, and which hardware they support.

11.2 vLLM

Core innovation: PagedAttention. The first open-source engine to make KV cache memory management a first-class concern.

  • (+) Highest throughput at high concurrency
  • (+) Broadest model support (most architectures on HuggingFace Hub)
  • (+) OpenAI-compatible API server (drop-in replacement)
  • (+) Active community (largest serving framework by GitHub stars)
  • (+) Multi-LoRA serving (serving multiple fine-tuned LoRA adapters from a single base model): shared weights across adapters, switching adapters per request with minimal overhead
  • (-) Historically higher TTFT than SGLang (gap has narrowed)
  • (-) Complex C++/CUDA core harder to extend than pure Python

Best for: High-concurrency production deployments, teams migrating from OpenAI API.

11.3 SGLang

Core innovation: RadixAttention for prefix caching, structured generation primitives.

  • (+) Best for agentic and multi-call workloads (RadixAttention shines here)
  • (+) Excellent prefix caching with cache-aware scheduling
  • (+) Clean Python codebase (scheduler is under 4K lines)
  • (+) Competitive or superior throughput on most benchmarks
  • (+) Structured/constrained decoding: compressed finite state machines for JSON schema enforcement, regex-guided generation, and grammar-constrained output -- up to 5x faster than naive constrained decoding
  • (-) Smaller community than vLLM
  • (-) More recent, less proven at extreme scale (thousands of GPUs)

Best for: Agents, RAG pipelines, structured output, prefix-heavy workloads.

11.4 TensorRT-LLM

Core innovation: Fused CUDA kernel compilation per model and input shape.

  • (+) Maximum single-request performance on NVIDIA hardware
  • (+) Deepest NVIDIA hardware optimization (FP8 Tensor Engine, custom attention kernels)
  • (+) Speculative decoding support with up to 3.6x throughput gains
  • (-) NVIDIA GPUs only
  • (-) Complex build process (model-specific compilation required)
  • (-) Harder to customize than Python-first frameworks

Best for: Latency-sensitive applications, maximum hardware utilization on NVIDIA GPUs.

11.5 llama.cpp / GGUF

Core innovation: CPU-first inference with aggressive quantization.

  • (+) Runs on commodity hardware (laptops, phones) with no GPU
  • (+) Best quantization ecosystem (1.5-bit to 8-bit with many format options)
  • (+) Pure C/C++ with no dependencies
  • (+) GPU offloading support (partial or full model on GPU)
  • (-) Not designed for high-throughput serving (limited batching support)
  • (-) No PagedAttention; limited continuous batching (server mode only)

Best for: Local/edge deployment, prototyping, privacy-sensitive applications, running models on hardware you already own.

11.6 Other Notable Frameworks

DeepSpeed-FastGen (Holmes et al., 2024): Microsoft's framework with Dynamic SplitFuse -- decomposes long prompts into chunks and composes short prompts to fill a target token budget. Integrated into Azure ML. Claims 2.3x throughput over vLLM in some scenarios.

TGI (Text Generation Inference): HuggingFace's serving solution. Good default for teams already in the HuggingFace ecosystem. Easy deployment, Flash-Decoding integration, reasonable performance without extensive tuning.

MLC LLM: Cross-platform compilation (CUDA, Metal, Vulkan, WebGPU). Compiles models for diverse hardware including phones and browsers. The choice when you need to run on non-NVIDIA hardware.

11.7 Choosing a Framework: Decision Guide

Serving Frameworks Comparison

FrameworkThroughputLatencyEase of UseHardwareBest For
vLLM
PagedAttention
NVIDIA, AMDHigh-concurrency production
SGLang
RadixAttention
NVIDIA, AMDAgentic / structured output
TensorRT-LLM
Fused CUDA kernels
NVIDIA onlyMax single-request perf
llama.cpp
CPU-first quantization
CPU, NVIDIA, Apple, VulkanLocal / edge deployment
TGI
HuggingFace integration
NVIDIA, AMD, Intel, TPUEasy production deployment
Click a row to see detailed features. Ratings are relative within GPU server frameworks.

Figure 10: Serving framework comparison. Filter by your hardware, workload type, and model size to find the best fit.

If you need...Use...Why
High-concurrency API servervLLMBest throughput at scale, OpenAI-compatible
Agent/RAG workloadsSGLangRadixAttention, prefix caching, structured output
Maximum latency optimizationTensorRT-LLMFused kernels, deepest hardware optimization
Run on laptop/phonellama.cppCPU-first, no GPU required
Cross-platform (Metal, WebGPU)MLC LLMCompiles to any backend
Azure ML integrationDeepSpeed-FastGenNative Microsoft stack
Quick start, HuggingFace modelsTGISimplest deployment

As far as I know, there is no single framework that dominates across all workloads. The field is moving too fast for any one system to maintain a permanent advantage. In late 2024 and early 2025, SGLang pulled ahead of vLLM on several benchmarks. vLLM responded with significant performance improvements. TensorRT-LLM remains fastest for latency-sensitive single-request scenarios. The leaderboard shifts every few months.

My practical advice: start with vLLM or SGLang (both are easy to set up and have OpenAI-compatible APIs). Here is a minimal setup for each:

# vLLM: 3 lines to a production-grade API server
pip install vllm
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-8B-Instruct \
    --enable-prefix-caching
# Server is now running at http://localhost:8000
# Compatible with OpenAI Python client
# SGLang: similarly simple
pip install sglang[all]
python -m sglang.launch_server \
    --model-path meta-llama/Llama-3-8B-Instruct \
    --port 8000

Benchmark on your actual workload, and switch if you find a specific framework that is clearly better for your use case. Most production teams I have talked to evaluate 2-3 frameworks before committing.


12. Emerging Techniques

12.1 Cascade Inference

Route easy requests to small models and hard requests to large models. If a 7B model can handle a simple factual question, why spend compute on a 70B model?

The challenge is determining difficulty before running the large model. Several approaches exist:

  • Confidence-based routing: Run the small model first. If its output confidence (measured by log-probability or entropy) exceeds a threshold, return the small model's response. Otherwise, fall back to the large model.
  • Classifier-based routing: Train a lightweight classifier to predict whether a query needs a large model based on the input alone.
  • Speculative routing: Speculative decoding provides an implicit signal -- if the draft model's acceptance rate is very high, the request is "easy" and the draft model might have been sufficient on its own.

The economics are compelling. If 70% of requests can be handled by a 7B model (10x cheaper to serve), the average cost per request drops by ~60% with only a small quality reduction on the remaining 30% of hard queries. This is essentially what API providers do when they offer different model tiers.

12.2 KV Cache Compression

Beyond quantization, you can selectively evict less important KV cache entries to free memory.

H2O (Heavy-Hitter Oracle) (Zhang et al., NeurIPS 2023): Observes that a small subset of tokens ("Heavy Hitters") receive disproportionate attention across layers and heads. These are typically content-bearing tokens (nouns, verbs, key entities) rather than function words.

H2O's eviction policy keeps two categories of tokens:

  1. Heavy Hitters: The top-K most attended tokens, identified by cumulative attention scores
  2. Recent tokens: A sliding window of the most recent M tokens (needed for local coherence)

Everything else is evicted from the KV cache. With only 20% of tokens retained, generation quality degrades minimally on most tasks (within 1-2% on benchmarks), while throughput improves up to 29x because the attention computation and KV cache memory shrink proportionally.

Other approaches to KV cache compression include:

  • Token merging: Combine similar tokens' KV entries into a single representative entry
  • Low-rank approximation: Project the KV cache into a lower-dimensional space
  • Learned eviction: Train a small policy network to decide which tokens to evict
  • SnapKV: Combines observation window analysis with vote-based eviction

This is one of the most active research areas in LLM serving, with new papers appearing monthly.

12.3 Attention Sinks for Infinite Context (StreamingLLM)

StreamingLLM (Xiao et al., ICLR 2024) discovered a surprising phenomenon: LLMs attend strongly to the first few tokens regardless of their semantic content. These "sink" tokens absorb excess attention mass.

Why does this happen? Softmax requires attention weights to sum to 1. When a token has no strong preference for any particular earlier token, the attention mass has to go somewhere. It gets dumped onto the earliest tokens as a mathematical artifact -- they serve as an "attention sink." If you evict these initial tokens from the sliding window (as naive window attention would), the attention distribution breaks and generation quality collapses.

StreamingLLM's fix is simple: always keep the first 4 tokens (the "sink" tokens) in the KV cache, plus a sliding window of the most recent N tokens. The cache has constant size: 4 + N tokens, regardless of how many tokens have been generated.

Standard KV cache (grows forever):
  [tok_1, tok_2, ..., tok_1000, tok_1001, ..., tok_10000, ...]
  Memory grows: eventually OOM

Sliding window (fails -- drops sinks):
  [..., tok_9997, tok_9998, tok_9999, tok_10000]
  Quality collapses when tok_1-tok_4 are evicted

StreamingLLM (constant memory):
  [tok_1, tok_2, tok_3, tok_4, ..., tok_9997, tok_9998, tok_9999, tok_10000]
  ↑ sink tokens (always kept)     ↑ sliding window (recent N tokens)

This enables handling 4M+ token streams with 22.2x speedup over sliding-window recomputation. The technique has been integrated into TensorRT-LLM and HuggingFace Transformers.

The limitation is important: this enables infinite generation, not infinite understanding. The model can only attend to the sink tokens and the recent window -- it has no access to tokens that have been evicted. For applications requiring true long-context understanding (like analyzing a 100K-token document), you still need the full KV cache.

12.4 Disaggregated Serving at Scale

The Mooncake approach is evolving into multi-tier KV cache hierarchies:

  • Tier 1: GPU HBM (fastest, smallest)
  • Tier 2: CPU DRAM (10-50x larger, 10x slower)
  • Tier 3: SSD / NVMe (100x larger, 100x slower)
  • Tier 4: Remote storage via RDMA

KV cache becomes a distributed cache service, with migration and replication policies similar to content delivery networks. This is where LLM serving infrastructure starts to look more like distributed systems research than ML research.

12.5 Hardware-Aware Serving

FP4 quantization: NVIDIA's Blackwell (B100/B200) GPUs support FP4 Tensor Cores, which could deliver up to 2x throughput over FP8 if quality retention proves sufficient. Early results show 96-98% quality retention.

CXL (Compute Express Link): Enables expanded memory pools by connecting CPU and GPU memory through a coherent interconnect. A CXL-attached DRAM pool could provide 10x more memory for KV caches at moderate bandwidth.

Specialized inference accelerators: Groq's LPU (Language Processing Unit) has demonstrated 500+ tokens/second per user in public demos (as of early 2025) by using SRAM (Static RAM -- fast, small on-chip memory)-only architecture that eliminates the HBM bottleneck entirely. Since decode is memory-bandwidth-bound and SRAM is 10x faster than HBM, removing HBM from the critical path fundamentally changes the performance equation. Cerebras WSE-3 takes a similar approach, fitting entire models in on-chip memory across its wafer-scale chip.

These accelerators have limitations -- cost per chip, limited flexibility, and smaller software ecosystems -- but they demonstrate that the memory-bound decode bottleneck is a hardware design problem, not just a software optimization problem.

The trend: hardware and software co-design, where serving systems, model architectures, and accelerator hardware are designed together. GQA is a good early example: a model architecture change (fewer KV heads) that was motivated by serving efficiency. Future models will likely incorporate more serving-aware design decisions during training.


13. Conclusion

In short, the key takeaways from this article:

  • LLM serving is a memory management problem. The KV cache -- not the model weights -- is the bottleneck once you are serving concurrent requests. Every technique in this article ultimately manages KV cache memory.

  • The prefill-decode duality shapes everything. Prefill is compute-bound; decode is memory-bound. These phases want different batch sizes, different parallelism strategies, and even different hardware. Recognizing this duality is the key to understanding serving architecture.

  • Continuous batching + PagedAttention are table stakes. Any production serving system in 2026 should use both. Together they deliver 10-50x throughput over naive static batching -- the single biggest win comes from simply switching from HuggingFace Transformers to vLLM or SGLang.

  • Speculative decoding offers "free" speedup. The output distribution is mathematically identical to standard decoding. Use self-speculative methods (Medusa, EAGLE) if you want the speedup without maintaining a second model.

  • Prefill-decode disaggregation is becoming the default for large-scale deployment. The engineering overhead is real, but 2-7x more requests within SLO is hard to ignore.

  • Prefix caching should always be enabled. It is free throughput for any workload with shared structure -- and most real workloads do.

  • FP8 on H100 is the current sweet spot for quantization: 30-50% throughput gain at 99%+ quality. INT4 weights + FP8 KV cache for maximum memory efficiency.

  • No single serving framework dominates. vLLM for high concurrency, SGLang for agentic workloads, TensorRT-LLM for maximum hardware utilization, llama.cpp for local deployment.

The speed of innovation in LLM serving has been remarkable. In just three years (2022-2025), the field went from naive static batching to sophisticated systems that borrow ideas from OS kernels, compilers, and distributed databases. For a framework to organize all of these techniques, see the next section.


14. Bonus: The Serving Stack -- A Mental Model

After 13 sections of techniques, it helps to have a framework for organizing everything. Here is how I think about the LLM serving stack, from bottom to top:

Layer 1 -- Hardware: GPU memory hierarchy (HBM, SRAM, registers), interconnect topology (NVLink, InfiniBand, PCIe), specialized accelerators. Every optimization ultimately maps to a hardware capability.

Layer 2 -- Memory Management: PagedAttention, KV cache quantization, KV cache compression (H2O), prefix caching (RadixAttention). Controls how the scarce KV cache memory is allocated, shared, and reclaimed.

Layer 3 -- Execution Engine: Continuous batching, chunked prefill, FlashAttention kernels, fused CUDA kernels. Controls how computation is organized within a single forward pass.

Layer 4 -- Scheduling: Request admission control, preemption (swap vs. recompute), SLO-aware scheduling, cache-aware routing. Controls which requests run when.

Layer 5 -- Optimization: Quantization (weight, activation, KV cache), speculative decoding. Orthogonal techniques that reduce resource requirements.

Layer 6 -- Architecture: Disaggregated prefill/decode, parallelism strategy (TP, PP, EP), multi-tier KV cache hierarchy. Controls the overall system topology.

Layer 7 -- API: OpenAI-compatible endpoint, streaming responses, structured output constraints, multi-LoRA routing. The interface users and applications see.

When you encounter a new technique or paper, ask: which layer does it operate at? What resource constraint does it address? This mental model turns a collection of disparate tricks into a coherent system design.

For example:

  • Mooncake operates at layers 2, 4, and 6 simultaneously -- that is what makes it powerful. It reimagined memory management (KV pool), scheduling (Conductor), and architecture (disaggregation) as a unified system.
  • EAGLE operates at layer 5 (optimization), but its benefits cascade upward: faster decode means shorter request lifetimes, which means the scheduler (layer 4) has more room to admit requests and the execution engine (layer 3) can batch more efficiently.
  • FP8 operates at layers 1 and 5 (hardware capability + software optimization). The technique only works because H100 hardware provides FP8 Tensor Cores -- a good example of hardware-software co-design.
  • Chunked prefill operates at layer 3 (execution engine) but was motivated by scheduling problems at layer 4. The prefill bubble was a scheduling issue that got solved by changing the execution strategy.

The most impactful innovations tend to span multiple layers. A single-layer optimization (like a faster kernel) gives incremental gains. A system that co-optimizes across layers (like Mooncake combining memory management, scheduling, and architecture) gives multiplicative gains.

I strongly suspect the biggest advances in the next 2-3 years will come from this cross-layer co-optimization -- techniques that treat the serving stack as a single integrated system rather than a collection of independent components. The serving stack is converging with the OS kernel design community for good reason: both manage scarce shared resources under competing demands with strict performance constraints.

LLM serving has become its own subfield of systems research. The mental model above should help you navigate whatever comes next.

Remember those 20 concurrent users that crashed your 70B model? Now you know exactly why: each 4K-token request allocates 2.5GB of KV cache, and 20 of those exceeds even 8×H100s' pooled memory. The fix: PagedAttention eliminates fragmentation waste, continuous batching maximizes throughput, and speculative decoding generates faster without changing the output. The same $200K GPU cluster now handles 500+ concurrent users—not because you threw more hardware at it, but because you understood where the memory actually goes.


Further Reading

Foundational Papers:

  • Vaswani et al. (2017). "Attention Is All You Need." NeurIPS 2017. arXiv:1706.03762
  • Pope et al. (2023). "Efficiently Scaling Transformer Inference." MLSys 2023. arXiv:2211.05102
  • Shoeybi et al. (2020). "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism." arXiv:1909.08053
  • Huang et al. (2019). "GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism." NeurIPS 2019. arXiv:1811.06965

Core Serving Systems:

  • Yu et al. (2022). "Orca: A Distributed Serving System for Transformer-Based Generative Models." OSDI 2022. USENIX
  • Kwon et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP 2023. arXiv:2309.06180
  • Zheng et al. (2024). "SGLang: Efficient Execution of Structured Language Model Programs." NeurIPS 2024. arXiv:2312.07104
  • Agrawal et al. (2024). "Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve." OSDI 2024. arXiv:2403.02310

Speculative Decoding:

  • Leviathan et al. (2023). "Fast Inference from Transformers via Speculative Decoding." ICML 2023. arXiv:2211.17192
  • Chen et al. (2023). "Accelerating Large Language Model Decoding with Speculative Sampling." arXiv:2302.01318
  • Li et al. (2024). "EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty." ICML 2024. arXiv:2401.15077
  • Xia et al. (2024). "Unlocking Efficiency in LLM Inference: A Comprehensive Survey of Speculative Decoding." ACL Findings 2024. arXiv:2401.07851

Disaggregation:

  • Zhong et al. (2024). "DistServe: Disaggregating Prefill and Decoding for Goodput-optimized LLM Serving." OSDI 2024. arXiv:2401.09670
  • Patel et al. (2024). "Splitwise: Efficient Generative LLM Inference Using Phase Splitting." ISCA 2024. arXiv:2311.18677
  • Qin et al. (2024). "Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving." FAST 2025 Best Paper. arXiv:2407.00079

Quantization:

  • Frantar et al. (2023). "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023. arXiv:2210.17323
  • Lin et al. (2024). "AWQ: Activation-aware Weight Quantization." MLSys 2024 Best Paper. arXiv:2306.00978
  • Xiao et al. (2023). "SmoothQuant: Accurate and Efficient Post-Training Quantization for LLMs." ICML 2023. arXiv:2211.10438

KV Cache Management:

  • Shazeer (2019). "Fast Transformer Decoding: One Write-Head is All You Need." arXiv:1911.02150
  • Ainslie et al. (2023). "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." EMNLP 2023. arXiv:2305.13245
  • Xiao et al. (2024). "Efficient Streaming Language Models with Attention Sinks." ICLR 2024. arXiv:2309.17453
  • Zhang et al. (2023). "H2O: Heavy-Hitter Oracle for Efficient Generative Inference of LLMs." NeurIPS 2023. arXiv:2306.14048

Educational Resources:

Code:

  • vLLM -- PagedAttention, continuous batching, multi-LoRA
  • SGLang -- RadixAttention, structured generation
  • llama.cpp -- CPU-first inference, GGUF quantization
  • TensorRT-LLM -- NVIDIA-optimized serving