Speculative Decoding, Formally: The Algorithm, the Proof, and the Metrics That Matter
*Part 2 of 6 — Speculative Decoding series. Start with The Need for Speed.*
1. The Need for Speed — Why LLMs are memory-bandwidth-bound and how speculation exploits the GPU's idle compute. 2. Speculative Decoding, Formally (this post) — The draft-then-verify algorithm, the rejection-sampling proof, and the metrics that matter. 3. A Field Guide to Speculative Decoding Methods — Separate models, Medusa, n-gram, and trees: a taxonomy of every drafting approach. 4. The EAGLE Family — How predicting hidden states instead of tokens reshaped the field, across three generations. 5. Parallel Drafting with Block Diffusion — DFlash collapses γ draft passes into one; DDTree turns that pass into a 7× verification tree. 6. Putting It to Work — Enabling EAGLE, n-gram, and Medusa speculation in vLLM and SGLang, with a guide to measuring real speedup.
---
*Speculative decoding makes an LLM generate faster without changing its output. It can work because LLM decoding is memory-bandwidth-bound — at batch size 1 the GPU runs at well under 1% of its compute, spending almost all its time reading the full model out of memory just to emit one token. That idle compute is the room speculation fills: verifying ten proposed tokens reads the weights once, whereas generating ten reads them ten times.*
This post makes that intuition exact. We give the draft-then-verify algorithm precisely, *prove* that it leaves the model's output distribution untouched, and then define the metrics you need to tell a genuine speedup from a misleading one.
The draft-then-verify paradigm
Speculative decoding splits generation into two stages:
1. Draft. A fast, cheap model $Mq$ (the *draft model*) generates $\gamma$ candidate tokens autoregressively: $\tilde{y}1, \tilde{y}2, \ldots, \tilde{y}\gamma$. 2. Verify. The expensive target model $M_p$ processes all $\gamma$ candidates in a *single* forward pass and decides which to accept.
The whole game rests on one asymmetry from The Need for Speed: verification is parallel while generation is sequential. The target can compute $p(\cdot \mid \tilde{y}_{ If $k$ of the $\gamma$ tokens are accepted, we have produced $k+1$ tokens (the $k$ accepted, plus one fresh token — resampled at the rejection point if $k<\gamma$, or the bonus token if all $\gamma$ are accepted) using a single forward pass of the target plus $\gamma$ cheap draft passes. When $k$ is large relative to the drafting overhead, that is a real speedup.The rejection-sampling algorithm
The mathematical core is a modified rejection-sampling scheme that guarantees *exact* distribution matching.
Input: target $Mp$, draft $Mq$, draft length $\gamma$, prefix $x$.
Draft step. For $t = 1, \ldots, \gamma$: sample $\tilde{y}t \sim q(\cdot \mid x, \tilde{y}{1:t-1})$.
Verify step. Run $Mp$ on $(x, \tilde{y}{1:\gamma})$ to obtain $p(\cdot \mid x, \tilde{y}_{1:t-1})$ for all $t = 1, \ldots, \gamma+1$.
Accept / reject. For $t = 1, \ldots, \gamma$:
Adjusted distribution. On rejection at position $t$:
$yt \sim \text{norm}\left(\max\left(0,\; p(\cdot \mid x, y{1:t-1}) - q(\cdot \mid x, y_{1:t-1})\right)\right),$
where $\text{norm}(\cdot)$ rescales to sum to 1.
Bonus token. If all $\gamma$ tokens are accepted, sample one extra token $y{\gamma+1} \sim p(\cdot \mid x, y{1:\gamma})$ for free — we already computed that distribution.
Proof of losslessness
This is the property that makes speculative decoding a free lunch rather than another lossy approximation, so it is worth proving. We show the output at any position follows exactly $p$. Fix a position and let $p = p(\cdot \mid \text{prefix})$, $q = q(\cdot \mid \text{prefix})$. The probability the output is token $v$ is
$P(\text{output} = v) = P(\text{accept } v) + P(\text{reject, then resample } v).$
Case 1: $q(v) \le p(v)$. The token is proposed with probability $q(v)$ and accepted with probability $\min(1, p(v)/q(v)) = 1$:
$P(\text{accept } v) = q(v).$
It can also arrive via rejection-and-resample. The residual mass on $v$ is $p(v) - q(v)$, and the residual distribution is normalized by $1-\alpha$ where $\alpha = \sum_{v'} \min(p(v'), q(v'))$ is the total acceptance probability. The resample route contributes
$P(\text{reject, resample } v) = (1-\alpha)\cdot \frac{p(v)-q(v)}{1-\alpha} = p(v) - q(v).$
The normalization is valid because, since $\sum p = \sum q = 1$, total excess of $p$ over $q$ equals total excess of $q$ over $p$:
$\sum{v': p > q}(p(v') - q(v')) = \sum{v': q > p}(q(v') - p(v')) = 1-\alpha.$
Summing: $P(\text{output} = v) = q(v) + p(v) - q(v) = p(v).$ ✓
Case 2: $q(v) > p(v)$. Now acceptance probability is $p(v)/q(v) < 1$, and the residual gives $v$ zero mass since $\max(0, p(v)-q(v)) = 0$:
$P(\text{accept } v) = q(v)\cdot \frac{p(v)}{q(v)} = p(v), \qquad P(\text{reject, resample } v) = 0.$
So $P(\text{output} = v) = p(v).$ ✓
In both cases the output is distributed exactly as $p$. $\blacksquare$
This argument is per-position, but it lifts to the whole sequence immediately: each accepted token extends the prefix, and the next verification conditions on that exact prefix, so applying the result position by position shows every emitted token — and therefore the entire generated sequence — is distributed exactly as the target $p$.
The draft model's quality never appears in the conclusion. A *good* draft is accepted often (big speedup); a *bad* draft is rejected often (small speedup); neither changes the distribution. Contrast this with the lossy alternatives:
Speculative decoding is the rare systems-level optimization with an exact correctness guarantee.
How much do we gain? Expected tokens per step
Let $\alpha_t$ be the acceptance rate at position $t$:
$\alphat = \sum{v \in \mathcal{V}} \min\big(pt(v), qt(v)\big) = 1 - \text{TV}(pt, qt),$
where $\text{TV}(pt, qt) = \tfrac{1}{2}\lVert pt - qt\rVert_1$ is the total-variation distance. This identity is worth pausing on: acceptance rate is one minus the TV distance between draft and target. A draft that matches the target ($q=p$) gives $\alpha = 1$; a draft with disjoint support gives $\alpha = 0$.
Modeling each position as accepted independently with a constant rate $\alpha$ (an idealization — real acceptances are correlated and $\alpha_t$ varies by position, but it captures the dynamics well), the expected number of *new tokens per step* (accepted tokens plus the bonus) has a clean closed form:
$\mathbb{E}[\text{tokens/step}] = \frac{1 - \alpha^{\gamma+1}}{1 - \alpha}.$
A few values make the dynamics concrete:
The acceptance rate $\alpha$ is the single most important number governing performance — but, as we will see, it is *necessary but not sufficient*.
---
The metrics that matter
It is easy to report a flattering number. A method can boast a high acceptance rate yet deliver no wall-clock benefit because its drafting is too expensive. Below are the metrics a complete evaluation needs, and how they relate.
1. Token acceptance rate ($\alpha$)
The empirical acceptance rate over a run, pooled over every token that actually received an accept/reject test:
$\bar\alpha = \frac{\sum{i=1}^{N} (\text{accepted tokens in step } i)}{\sum{i=1}^{N} (\text{verified tokens in step } i)}.$
A "verified" token is one that was subjected to the accept/reject decision — the accepted tokens plus the single rejected token at the stopping point. The drafted tokens *discarded* after a rejection are not counted: they never reached verification. (Dividing by the full draft block $\gamma$ instead would conflate acceptance rate with draft-budget utilization and systematically *underestimate* $\alpha$ — e.g. reporting $\approx 0.54$ when the true per-token rate is $0.8$.)
It is the strongest *single* predictor of performance, but it is distribution-, position-, and temperature-dependent: easy tokens (common words, deterministic continuations) accept more readily than hard ones (creative or technical content); early tokens in a response accept less; lower temperature concentrates mass and usually raises $\alpha$.
2. Expected tokens per step ($\bar k$)
Including the bonus token,
$\bar k = \frac{1 - \alpha^{\gamma+1}}{1 - \alpha}.$
This is the amortized number of new tokens per target forward pass. The table below shows why blindly increasing $\gamma$ is a trap:
| $\alpha$ | $\gamma=3$ | $\gamma=5$ | $\gamma=7$ | $\gamma=10$ | |----------|-----------|-----------|-----------|------------| | 0.5 | 1.88 | 1.97 | 1.99 | 2.00 | | 0.7 | 2.53 | 2.94 | 3.14 | 3.27 | | 0.8 | 2.95 | 3.69 | 4.16 | 4.57 | | 0.9 | 3.44 | 4.69 | 5.70 | 6.86 | | 0.95 | 3.71 | 5.30 | 6.73 | 8.62 |
For low $\alpha$, increasing $\gamma$ past 3–5 buys almost nothing (every extra draft token needs *all* prior ones accepted, probability $\alpha^k$); for high $\alpha$, longer drafts pay off handsomely.
3. Wall-clock speedup ($S$) — the metric that actually pays the bills
Let $cp$ be the cost of one target forward pass, $cq$ the cost of one draft pass. For $N$ tokens, autoregressive decoding costs $N cp$. Speculative decoding runs $N/\bar k$ steps, each costing $c{\text{draft}} + c_{\text{verify}}$:
$S = \frac{T{\text{AR}}}{T{\text{spec}}} = \frac{\bar k\, cp}{c{\text{draft}} + c{\text{verify}}} \approx \frac{\bar k\, cp}{\gamma\, cq + cp}.$
Defining the draft overhead ratio $\rho = cq / cp$:
$\boxed{\,S \approx \frac{\bar k}{1 + \gamma\rho}\,}$
This formula is the whole story in miniature. The numerator $\bar k$ is what acceptance buys you; the denominator $1 + \gamma\rho$ is what drafting costs you. A draft model with $\alpha = 0.9$ but $\rho = 0.5$ (too big a draft) can easily *lose* to one with $\alpha = 0.75$ and $\rho = 0.02$. A typical well-chosen draft has $\rho < 0.05$ — under 5% of the target's cost.
One subtlety: $c{\text{verify}} \ne cp$ exactly. Verifying $\gamma$ tokens costs slightly more than generating one because attention runs over a longer sequence and the FFN processes $\gamma$ tokens — but since the FFN is memory-bound at batch size 1, $c{\text{verify}} \approx cp + \epsilon$ for moderate $\gamma$.
4. Optimal draft length ($\gamma^\star$)
Substituting $\bar k$ into $S$:
$S(\gamma) = \frac{1 - \alpha^{\gamma+1}}{(1-\alpha)(1 + \gamma\rho)}.$
Setting $\partial S/\partial\gamma = 0$ gives an implicit equation with no closed form,
$-\alpha^{\gamma^\star+1}\ln\alpha\,(1 + \gamma^\star\rho) = \rho\,(1 - \alpha^{\gamma^\star+1}),$
but the intuition is clean: high $\alpha$, low $\rho$ → longer drafts; low $\alpha$, high $\rho$ → shorter drafts.
5. Throughput, latency, and burstiness
What to report
| Metric | Symbol | What it tells you | |--------|--------|-------------------| | Acceptance rate | $\alpha$ | Draft quality | | Tokens per step | $\bar k$ | Amortized generation efficiency | | Wall-clock speedup | $S$ | End-to-end practical benefit | | Optimal draft length | $\gamma^\star$ | Configuration guidance | | Memory overhead | $\Delta M$ | Deployment feasibility | | Throughput / latency | — | System-level performance and UX |
The single most common evaluation mistake is reporting only $\alpha$ or $\bar k$ and never measuring the clock. A high acceptance rate is necessary but not sufficient. Always close the loop with $S$.
---
We now have the algorithm, the proof, and the yardsticks. What we have described so far is *vanilla* speculative decoding — one small model drafting for one big one. But that is only the first move in a large and inventive design space: Who drafts? How? In a chain or a tree? In A Field Guide to Methods we survey the landmark advances — Medusa, Lookahead, staged cascades, SpecInfer, Sequoia, self-speculation — and the trade-offs that distinguish them.
---
References
1. Leviathan, Y., Kalman, M., & Matias, Y. (2023). *Fast Inference from Transformers via Speculative Decoding.* ICML. arXiv:2211.17192. 2. Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., & Jumper, J. (2023). *Accelerating Large Language Model Decoding with Speculative Sampling.* arXiv:2302.01318. 3. Xia, H., et al. (2024). *Unlocking Efficiency in LLM Inference: A Comprehensive Survey of Speculative Decoding.* ACL Findings. arXiv:2401.07851. 4. Leviathan, Y., et al. — see the original paper's appendix for the formal correctness proof of speculative sampling.