Speculative Decoding, Formally: The Algorithm, the Proof, and the Metrics That Matter

June 25, 2026

*Part 2 of 6 — Speculative Decoding series. Start with The Need for Speed.*

1. The Need for Speed — Why LLMs are memory-bandwidth-bound and how speculation exploits the GPU's idle compute. 2. Speculative Decoding, Formally (this post) — The draft-then-verify algorithm, the rejection-sampling proof, and the metrics that matter. 3. A Field Guide to Speculative Decoding Methods — Separate models, Medusa, n-gram, and trees: a taxonomy of every drafting approach. 4. The EAGLE Family — How predicting hidden states instead of tokens reshaped the field, across three generations. 5. Parallel Drafting with Block Diffusion — DFlash collapses γ draft passes into one; DDTree turns that pass into a 7× verification tree. 6. Putting It to Work — Enabling EAGLE, n-gram, and Medusa speculation in vLLM and SGLang, with a guide to measuring real speedup.

---

*Speculative decoding makes an LLM generate faster without changing its output. It can work because LLM decoding is memory-bandwidth-bound — at batch size 1 the GPU runs at well under 1% of its compute, spending almost all its time reading the full model out of memory just to emit one token. That idle compute is the room speculation fills: verifying ten proposed tokens reads the weights once, whereas generating ten reads them ten times.*

This post makes that intuition exact. We give the draft-then-verify algorithm precisely, *prove* that it leaves the model's output distribution untouched, and then define the metrics you need to tell a genuine speedup from a misleading one.

The draft-then-verify paradigm

Speculative decoding splits generation into two stages:

1. Draft. A fast, cheap model $Mq$ (the *draft model*) generates $\gamma$ candidate tokens autoregressively: $\tilde{y}1, \tilde{y}2, \ldots, \tilde{y}\gamma$. 2. Verify. The expensive target model $M_p$ processes all $\gamma$ candidates in a *single* forward pass and decides which to accept.

The whole game rests on one asymmetry from The Need for Speed: verification is parallel while generation is sequential. The target can compute $p(\cdot \mid \tilde{y}_{

If $k$ of the $\gamma$ tokens are accepted, we have produced $k+1$ tokens (the $k$ accepted, plus one fresh token — resampled at the rejection point if $k<\gamma$, or the bonus token if all $\gamma$ are accepted) using a single forward pass of the target plus $\gamma$ cheap draft passes. When $k$ is large relative to the drafting overhead, that is a real speedup.

The rejection-sampling algorithm

The mathematical core is a modified rejection-sampling scheme that guarantees *exact* distribution matching.

Input: target $Mp$, draft $Mq$, draft length $\gamma$, prefix $x$.

Draft step. For $t = 1, \ldots, \gamma$: sample $\tilde{y}t \sim q(\cdot \mid x, \tilde{y}{1:t-1})$.

Verify step. Run $Mp$ on $(x, \tilde{y}{1:\gamma})$ to obtain $p(\cdot \mid x, \tilde{y}_{1:t-1})$ for all $t = 1, \ldots, \gamma+1$.

Accept / reject. For $t = 1, \ldots, \gamma$:

  • Draw $r \sim \text{Uniform}(0,1)$.
  • If $r < \dfrac{p(\tilde{y}t \mid x, \tilde{y}{1:t-1})}{q(\tilde{y}t \mid x, \tilde{y}{1:t-1})}$: accept $\tilde{y}_t$.
  • Otherwise: reject $\tilde{y}t$, sample a replacement from the adjusted distribution, discard $\tilde{y}{t+1:\gamma}$, and stop.
  • Adjusted distribution. On rejection at position $t$:

    $yt \sim \text{norm}\left(\max\left(0,\; p(\cdot \mid x, y{1:t-1}) - q(\cdot \mid x, y_{1:t-1})\right)\right),$

    where $\text{norm}(\cdot)$ rescales to sum to 1.

    Bonus token. If all $\gamma$ tokens are accepted, sample one extra token $y{\gamma+1} \sim p(\cdot \mid x, y{1:\gamma})$ for free — we already computed that distribution.

    Proof of losslessness

    This is the property that makes speculative decoding a free lunch rather than another lossy approximation, so it is worth proving. We show the output at any position follows exactly $p$. Fix a position and let $p = p(\cdot \mid \text{prefix})$, $q = q(\cdot \mid \text{prefix})$. The probability the output is token $v$ is

    $P(\text{output} = v) = P(\text{accept } v) + P(\text{reject, then resample } v).$

    Case 1: $q(v) \le p(v)$. The token is proposed with probability $q(v)$ and accepted with probability $\min(1, p(v)/q(v)) = 1$:

    $P(\text{accept } v) = q(v).$

    It can also arrive via rejection-and-resample. The residual mass on $v$ is $p(v) - q(v)$, and the residual distribution is normalized by $1-\alpha$ where $\alpha = \sum_{v'} \min(p(v'), q(v'))$ is the total acceptance probability. The resample route contributes

    $P(\text{reject, resample } v) = (1-\alpha)\cdot \frac{p(v)-q(v)}{1-\alpha} = p(v) - q(v).$

    The normalization is valid because, since $\sum p = \sum q = 1$, total excess of $p$ over $q$ equals total excess of $q$ over $p$:

    $\sum{v': p > q}(p(v') - q(v')) = \sum{v': q > p}(q(v') - p(v')) = 1-\alpha.$

    Summing: $P(\text{output} = v) = q(v) + p(v) - q(v) = p(v).$ ✓

    Case 2: $q(v) > p(v)$. Now acceptance probability is $p(v)/q(v) < 1$, and the residual gives $v$ zero mass since $\max(0, p(v)-q(v)) = 0$:

    $P(\text{accept } v) = q(v)\cdot \frac{p(v)}{q(v)} = p(v), \qquad P(\text{reject, resample } v) = 0.$

    So $P(\text{output} = v) = p(v).$ ✓

    In both cases the output is distributed exactly as $p$. $\blacksquare$

    This argument is per-position, but it lifts to the whole sequence immediately: each accepted token extends the prefix, and the next verification conditions on that exact prefix, so applying the result position by position shows every emitted token — and therefore the entire generated sequence — is distributed exactly as the target $p$.

    The draft model's quality never appears in the conclusion. A *good* draft is accepted often (big speedup); a *bad* draft is rejected often (small speedup); neither changes the distribution. Contrast this with the lossy alternatives:

  • Quantization — approximates the distribution.
  • Pruning — reduces model capacity.
  • Distillation — trains a smaller, weaker model.
  • Early exit — truncates computation.
  • Speculative decoding is the rare systems-level optimization with an exact correctness guarantee.

    How much do we gain? Expected tokens per step

    Let $\alpha_t$ be the acceptance rate at position $t$:

    $\alphat = \sum{v \in \mathcal{V}} \min\big(pt(v), qt(v)\big) = 1 - \text{TV}(pt, qt),$

    where $\text{TV}(pt, qt) = \tfrac{1}{2}\lVert pt - qt\rVert_1$ is the total-variation distance. This identity is worth pausing on: acceptance rate is one minus the TV distance between draft and target. A draft that matches the target ($q=p$) gives $\alpha = 1$; a draft with disjoint support gives $\alpha = 0$.

    Modeling each position as accepted independently with a constant rate $\alpha$ (an idealization — real acceptances are correlated and $\alpha_t$ varies by position, but it captures the dynamics well), the expected number of *new tokens per step* (accepted tokens plus the bonus) has a clean closed form:

    $\mathbb{E}[\text{tokens/step}] = \frac{1 - \alpha^{\gamma+1}}{1 - \alpha}.$

    A few values make the dynamics concrete:

  • $\alpha = 0.8,\ \gamma = 5$: $\dfrac{1 - 0.8^6}{0.2} = 3.69$ tokens/step.
  • $\alpha = 0.9,\ \gamma = 5$: $\dfrac{1 - 0.9^6}{0.1} = 4.69$ tokens/step.
  • $\alpha = 0.5,\ \gamma = 5$: $\dfrac{1 - 0.5^6}{0.5} = 1.97$ tokens/step.
  • The acceptance rate $\alpha$ is the single most important number governing performance — but, as we will see, it is *necessary but not sufficient*.

    ---

    The metrics that matter

    It is easy to report a flattering number. A method can boast a high acceptance rate yet deliver no wall-clock benefit because its drafting is too expensive. Below are the metrics a complete evaluation needs, and how they relate.

    1. Token acceptance rate ($\alpha$)

    The empirical acceptance rate over a run, pooled over every token that actually received an accept/reject test:

    $\bar\alpha = \frac{\sum{i=1}^{N} (\text{accepted tokens in step } i)}{\sum{i=1}^{N} (\text{verified tokens in step } i)}.$

    A "verified" token is one that was subjected to the accept/reject decision — the accepted tokens plus the single rejected token at the stopping point. The drafted tokens *discarded* after a rejection are not counted: they never reached verification. (Dividing by the full draft block $\gamma$ instead would conflate acceptance rate with draft-budget utilization and systematically *underestimate* $\alpha$ — e.g. reporting $\approx 0.54$ when the true per-token rate is $0.8$.)

    It is the strongest *single* predictor of performance, but it is distribution-, position-, and temperature-dependent: easy tokens (common words, deterministic continuations) accept more readily than hard ones (creative or technical content); early tokens in a response accept less; lower temperature concentrates mass and usually raises $\alpha$.

    2. Expected tokens per step ($\bar k$)

    Including the bonus token,

    $\bar k = \frac{1 - \alpha^{\gamma+1}}{1 - \alpha}.$

    This is the amortized number of new tokens per target forward pass. The table below shows why blindly increasing $\gamma$ is a trap:

    | $\alpha$ | $\gamma=3$ | $\gamma=5$ | $\gamma=7$ | $\gamma=10$ | |----------|-----------|-----------|-----------|------------| | 0.5 | 1.88 | 1.97 | 1.99 | 2.00 | | 0.7 | 2.53 | 2.94 | 3.14 | 3.27 | | 0.8 | 2.95 | 3.69 | 4.16 | 4.57 | | 0.9 | 3.44 | 4.69 | 5.70 | 6.86 | | 0.95 | 3.71 | 5.30 | 6.73 | 8.62 |

    For low $\alpha$, increasing $\gamma$ past 3–5 buys almost nothing (every extra draft token needs *all* prior ones accepted, probability $\alpha^k$); for high $\alpha$, longer drafts pay off handsomely.

    3. Wall-clock speedup ($S$) — the metric that actually pays the bills

    Let $cp$ be the cost of one target forward pass, $cq$ the cost of one draft pass. For $N$ tokens, autoregressive decoding costs $N cp$. Speculative decoding runs $N/\bar k$ steps, each costing $c{\text{draft}} + c_{\text{verify}}$:

    $S = \frac{T{\text{AR}}}{T{\text{spec}}} = \frac{\bar k\, cp}{c{\text{draft}} + c{\text{verify}}} \approx \frac{\bar k\, cp}{\gamma\, cq + cp}.$

    Defining the draft overhead ratio $\rho = cq / cp$:

    $\boxed{\,S \approx \frac{\bar k}{1 + \gamma\rho}\,}$

    This formula is the whole story in miniature. The numerator $\bar k$ is what acceptance buys you; the denominator $1 + \gamma\rho$ is what drafting costs you. A draft model with $\alpha = 0.9$ but $\rho = 0.5$ (too big a draft) can easily *lose* to one with $\alpha = 0.75$ and $\rho = 0.02$. A typical well-chosen draft has $\rho < 0.05$ — under 5% of the target's cost.

    One subtlety: $c{\text{verify}} \ne cp$ exactly. Verifying $\gamma$ tokens costs slightly more than generating one because attention runs over a longer sequence and the FFN processes $\gamma$ tokens — but since the FFN is memory-bound at batch size 1, $c{\text{verify}} \approx cp + \epsilon$ for moderate $\gamma$.

    4. Optimal draft length ($\gamma^\star$)

    Substituting $\bar k$ into $S$:

    $S(\gamma) = \frac{1 - \alpha^{\gamma+1}}{(1-\alpha)(1 + \gamma\rho)}.$

    Setting $\partial S/\partial\gamma = 0$ gives an implicit equation with no closed form,

    $-\alpha^{\gamma^\star+1}\ln\alpha\,(1 + \gamma^\star\rho) = \rho\,(1 - \alpha^{\gamma^\star+1}),$

    but the intuition is clean: high $\alpha$, low $\rho$ → longer drafts; low $\alpha$, high $\rho$ → shorter drafts.

    5. Throughput, latency, and burstiness

  • Throughput $= \bar k / (c{\text{draft}} + c{\text{verify}})$ tokens/s.
  • Time to first token (TTFT) is essentially unchanged — it is dominated by the target's prefill.
  • Inter-token latency (ITL) becomes *bursty*: tokens arrive in a clump when a draft is accepted (ITL ≈ 0 within the burst), then pause for the next draft+verify cycle. Average throughput can be high while the *experience* feels uneven — worth keeping in mind for streaming UIs.
  • What to report

    | Metric | Symbol | What it tells you | |--------|--------|-------------------| | Acceptance rate | $\alpha$ | Draft quality | | Tokens per step | $\bar k$ | Amortized generation efficiency | | Wall-clock speedup | $S$ | End-to-end practical benefit | | Optimal draft length | $\gamma^\star$ | Configuration guidance | | Memory overhead | $\Delta M$ | Deployment feasibility | | Throughput / latency | — | System-level performance and UX |

    The single most common evaluation mistake is reporting only $\alpha$ or $\bar k$ and never measuring the clock. A high acceptance rate is necessary but not sufficient. Always close the loop with $S$.

    ---

    We now have the algorithm, the proof, and the yardsticks. What we have described so far is *vanilla* speculative decoding — one small model drafting for one big one. But that is only the first move in a large and inventive design space: Who drafts? How? In a chain or a tree? In A Field Guide to Methods we survey the landmark advances — Medusa, Lookahead, staged cascades, SpecInfer, Sequoia, self-speculation — and the trade-offs that distinguish them.

    ---

    References

    1. Leviathan, Y., Kalman, M., & Matias, Y. (2023). *Fast Inference from Transformers via Speculative Decoding.* ICML. arXiv:2211.17192. 2. Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., & Jumper, J. (2023). *Accelerating Large Language Model Decoding with Speculative Sampling.* arXiv:2302.01318. 3. Xia, H., et al. (2024). *Unlocking Efficiency in LLM Inference: A Comprehensive Survey of Speculative Decoding.* ACL Findings. arXiv:2401.07851. 4. Leviathan, Y., et al. — see the original paper's appendix for the formal correctness proof of speculative sampling.