The Need for Speed: Why LLMs Are Slow and What Speculation Promises

June 25, 2026

*Part 1 of 6 — Speculative Decoding series.*

1. The Need for Speed (this post) — Why LLMs are memory-bandwidth-bound and how speculation exploits the GPU's idle compute. 2. Speculative Decoding, Formally — The draft-then-verify algorithm, the rejection-sampling proof, and the metrics that matter. 3. A Field Guide to Speculative Decoding Methods — Separate models, Medusa, n-gram, and trees: a taxonomy of every drafting approach. 4. The EAGLE Family — How predicting hidden states instead of tokens reshaped the field, across three generations. 5. Parallel Drafting with Block Diffusion — DFlash collapses γ draft passes into one; DDTree turns that pass into a 7× verification tree. 6. Putting It to Work — Enabling EAGLE, n-gram, and Medusa speculation in vLLM and SGLang, with a guide to measuring real speedup.

---

We want large language models to run *fast*.

Not "fast" in the abstract — fast in the way that changes what you can build. A chat assistant that streams at 200 tokens/second feels alive; one that dribbles out 20 tokens/second feels broken. An agent that has to plan, call tools, and re-plan across dozens of LLM calls multiplies every millisecond of per-token latency into seconds of wall-clock waiting. A reasoning model that "thinks" for two thousand tokens before answering pays for every one of them sequentially. Latency is not a vanity metric. It is the budget that decides whether an idea is a product or a demo.

So the question that motivates this entire series is simple: how do we make an LLM generate tokens faster — without changing a single bit of its output?

That last clause is the hard part. We already know how to make models faster if we are willing to make them *worse*: quantize the weights, prune the heads, distill into a smaller student. Every one of those trades quality for speed. What we want is a free lunch — more speed, identical output. That sounds impossible. This series is about the technique that makes it not just possible but routine.

The promise of speculation

The idea comes from an old trick in CPU design called speculative execution. A modern processor, when it hits a branch (if (x) ... else ...), does not stall and wait to find out which way x goes. It *guesses* the likely direction, races ahead executing instructions on that assumption, and keeps a way to roll back if the guess was wrong. When the guess is right — which it usually is — the work is already done. When it is wrong, you throw it away and you are no worse off than if you had waited.

Speculative decoding applies exactly this principle to text generation, with a twist that goes beyond what CPUs need: a *mathematical guarantee* that speculation preserves the model's exact output distribution — not just a deterministic answer, but the full probabilities, even under random sampling.

The shape of the trick is this. Generating a token with a giant model is expensive. But *checking* whether a proposed token is one the giant model would have produced is cheap — you can check a whole batch of proposed tokens in a single pass. So:

1. Draft. A small, cheap model races ahead and guesses the next several tokens. 2. Verify. The big, expensive model checks all of those guesses *at once*, in a single forward pass, and accepts every guess that matches what it would have done anyway.

When the small model guesses well, the big model confirms five tokens for the price of roughly one step and you have just generated five tokens in the time it normally takes to generate one. When the small model guesses badly, the big model rejects the bad guesses and falls back to generating normally — you lose a little time on the wasted draft, but the output is drawn from exactly the distribution the big model defines — byte-for-byte identical under greedy decoding, and a statistically exact sample under temperature sampling. Bad guesses cost speed, never quality.

In practice this buys a 2–4× speedup with zero change in output distribution. That is the promise. The rest of this series is about how it works, how to measure it, and how the field has pushed that 2× toward 5× and beyond.

But to understand *why* this works — why "verifying many tokens" is so much cheaper than "generating many tokens" — we have to look under the hood at what an LLM actually does when it decodes. The entire speedup hides in one counterintuitive fact about modern hardware. So let's go find it.

---

Where the time actually goes

The autoregressive factorization

Every autoregressive language model defines a joint distribution over token sequences by factorizing it via the chain rule of probability:

$P(y1, y2, \ldots, yT \mid x) = \prod{t=1}^{T} P(yt \mid x, y1, \ldots, y_{t-1})$

where $x = (x1, \ldots, xn)$ is the input prompt and $y = (y1, \ldots, yT)$ is the generated continuation. At each step $t$, the model computes a distribution over the entire vocabulary $\mathcal{V}$ conditioned on everything that came before, and a token is selected — either greedily ($yt = \arg\maxv P(v \mid x, y_{

This factorization is *exact*; it introduces no approximation. But it imposes a strict sequential dependency: you cannot compute $P(yt \mid \cdot)$ without knowing $y{t-1}$, which requires $y_{t-2}$, and so on. This is the fundamental bottleneck. The chain rule says generation is a chain, and you cannot pull a chain faster by pulling on the far end.

Two phases: prefill and decode

Transformer inference has two computational phases with very different performance characteristics.

Prefill. All $n$ prompt tokens are processed simultaneously. The self-attention computation,

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{dk}}\right) V, \qquad Q, K, V \in \mathbb{R}^{n \times dk},$

is a matrix–*matrix* multiplication. For a model with $P$ parameters, prefill costs roughly $\text{FLOPs}_{\text{prefill}} \approx 2 n P$, organized as large GEMM operations that GPUs execute very efficiently.

Decode. Now we generate one token at a time. At step $t$ we have a single new token embedding $x_t \in \mathbb{R}^{1 \times d}$, and the attention becomes

$\text{Attention}(qt, K{1:t}, V{1:t}) = \text{softmax}\left(\frac{qt K{1:t}^\top}{\sqrt{dk}}\right) V_{1:t},$

a matrix–*vector* product. The FLOPs per decode step are $\text{FLOPs}_{\text{decode}} \approx 2P$ — the *same* $2P$ regardless of sequence length (attention is small relative to the feedforward layers at moderate context lengths).

The memory-bandwidth wall

Here is the crux. Each decode step must read essentially *all* the model's parameters out of GPU high-bandwidth memory (HBM) into the compute units. For a model with $P$ parameters in fp16, that is $2P$ bytes read to do $2P$ FLOPs of work. The arithmetic intensity — compute per byte moved — is

$\frac{\text{FLOPs}}{\text{Bytes}} = \frac{2P}{2P} = 1 \text{ FLOP/byte}.$

Now compare to what the hardware can do. An NVIDIA A100:

| Metric | Value | |--------|-------| | FP16 compute | 312 TFLOPS | | HBM bandwidth | 2.0 TB/s | | Compute-to-bandwidth ratio | 156 FLOP/byte |

The GPU *wants* 156 FLOPs of work for every byte you feed it. Decoding gives it 1. So during decoding, the GPU runs at under 1% of its compute capability — by this roofline estimate, nearly all of its time is spent waiting for weights to arrive from memory rather than doing arithmetic. Decode is not compute-bound; it is *memory-bandwidth-bound*.

The time for one decode step is therefore set by memory traffic, not arithmetic:

$T{\text{decode}} \approx \frac{2P}{\text{BW}{\text{mem}}}.$

For a 7B model on an A100:

$T_{\text{decode}} \approx \frac{2 \times 7\times10^{9}}{2\times10^{12}} = 7 \text{ ms} \;\Rightarrow\; \approx \frac{1000}{7} \approx 142 \text{ tokens/s},$

which is the theoretical ceiling; real single-stream decode lands somewhat lower (overheads, imperfect bandwidth utilization), but in the right ballpark.

This is the insight that makes speculation work. The GPU is mostly idle during decoding. Verifying ten proposed tokens reads the weights *once* — the same $2P$ bytes — and uses the formerly-idle compute capacity to check them all. (There is additional attention and KV work per token, but the dominant memory cost of reading the weights is shared across the batch.) Generating ten tokens the normal way reads the weights *ten times*. The wasted capacity is exactly the room speculation fills.

The KV cache

To avoid recomputing attention over the whole history at each step, we cache the per-layer key and value tensors — the *KV cache*:

$\text{KV Cache} = \{(Kl^{1:t}, Vl^{1:t})\}_{l=1}^{L}.$

Its footprint is

$\text{KV Memory} = 2 \times L \times d \times t \times \text{bytes/param},$

(2 for K and V, $L$ layers, hidden size $d$, $t$ cached tokens). For a 7B model (32 layers, $d = 4096$) at sequence length 2048 in fp16 this is $2 \times 32 \times 4096 \times 2048 \times 2 \approx$ 1 GB. The cache trades memory for computation — without it, each step would re-attend over the full history, making decoding quadratic in length.

Batching: a partial escape

The obvious way to raise arithmetic intensity is *batching* — process $B$ sequences at once, so the weights you read serve $B$ tokens of work:

$\text{Arithmetic Intensity}_{\text{batch}} \approx B.$

Once $B$ is large enough we cross from memory-bound to compute-bound, around the critical batch size $B^\star \approx 156$ on an A100. But batching has two hard limits:

1. Memory. Each sequence needs its own KV cache; total cache grows as $B \times \text{KV Memory}$ and quickly exhausts the GPU. 2. Latency ≠ throughput. Batching raises aggregate *throughput* (tokens/s across all users) but does not improve the *latency* of any single sequence — and under contention it can make it worse. For an interactive chat or a single agent, one user is waiting on one sequence, and batching does not help them.

Speculation, by contrast, attacks *latency directly* — it makes a single sequence finish sooner.

The fundamental challenge, in one breath

1. Sequential dependency — token $t$ needs token $t-1$; we cannot parallelize across time. 2. Low arithmetic intensity — each step reads the whole model to emit one token; the GPU is starved. 3. Memory scaling — the KV cache grows with length and batch, capping how much batching can rescue us.

The ideal fix would generate *multiple tokens per forward pass of the large model, without sacrificing output quality.* That is exactly what speculative decoding delivers, and it does so backed by a clean proof that the output distribution is preserved exactly.

We have the motivation (we want speed), the idea (draft then verify), and now the reason it can possibly pay off (decode wastes the GPU). In Speculative Decoding, Formally we make speculative decoding precise: the draft-then-verify algorithm, the rejection-sampling proof of losslessness, and the metrics you need to tell a real speedup from a paper one.

---

References

1. Leviathan, Y., Kalman, M., & Matias, Y. (2023). *Fast Inference from Transformers via Speculative Decoding.* ICML. arXiv:2211.17192. 2. Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., & Jumper, J. (2023). *Accelerating Large Language Model Decoding with Speculative Sampling.* arXiv:2302.01318. 3. Vaswani, A., et al. (2017). *Attention Is All You Need.* NeurIPS. arXiv:1706.03762. 4. Kwon, W., et al. (2023). *Efficient Memory Management for Large Language Model Serving with PagedAttention.* ACM SOSP. (vLLM) 5. Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). *FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.* NeurIPS. arXiv:2205.14135. 6. NVIDIA A100 Tensor Core GPU Architecture whitepaper (2020), for the compute/bandwidth figures.