A Field Guide to Speculative Decoding Methods

June 25, 2026

*Part 3 of 6 — Speculative Decoding series. Start with The Need for Speed.*

1. The Need for Speed — Why LLMs are memory-bandwidth-bound and how speculation exploits the GPU's idle compute. 2. Speculative Decoding, Formally — The draft-then-verify algorithm, the rejection-sampling proof, and the metrics that matter. 3. A Field Guide to Speculative Decoding Methods (this post) — Separate models, Medusa, n-gram, and trees: a taxonomy of every drafting approach. 4. The EAGLE Family — How predicting hidden states instead of tokens reshaped the field, across three generations. 5. Parallel Drafting with Block Diffusion — DFlash collapses γ draft passes into one; DDTree turns that pass into a 7× verification tree. 6. Putting It to Work — Enabling EAGLE, n-gram, and Medusa speculation in vLLM and SGLang, with a guide to measuring real speedup.

---

*Speculative decoding accelerates an LLM losslessly — a small, cheap draft model proposes several tokens, the large target model verifies them all in a single parallel forward pass, and a rejection-sampling rule guarantees the output is distributed exactly as the target's. Its payoff is captured by the speedup formula $S \approx \bar k / (1 + \gamma\rho)$: accepted tokens per step ($\bar k$) over the relative cost of drafting ($\gamma\rho$).*

Every advance in the field can be read as an attempt to push one of two levers: raise $\bar k$ (better, longer-reaching drafts) or lower the cost $\gamma\rho$ (cheaper drafting, cheaper verification).

This post is the landmark tour. Since the original 2023 papers, researchers have explored a rich design space along four axes: *who drafts*, *how they draft*, *what structure the draft takes*, and *how verification proceeds*. Here are the major families.

1. Vanilla speculative decoding

The original formulation, introduced independently by Leviathan et al. (2023) and Chen et al. (2023), uses a smaller model from the same family as the draft.

Target $Mp$: e.g. LLaMA-70B. Draft $Mq$: e.g. LLaMA-7B (same tokenizer, independently trained). Draft length $\gamma$: typically 3–5.

Draft $\gamma$ tokens autoregressively, run $M_p$ on prefix + draft in one pass, apply the rejection-sampling scheme from the previous post.

Speedup. With $\alpha \approx 0.7$, $\gamma = 5$: $\bar k = (1 - 0.7^6)/(0.3) \approx 2.94$. If the draft costs 5% of the target ($\rho = 0.05$): $S = 2.94 / (1 + 5\times0.05) \approx 2.35\times$.

Strengths: simple, lossless, no training (reuse an existing small model). Weaknesses: needs a compatible draft model in the same family; quality is capped by the small model's capability gap.

2. Medusa: parallel draft heads

Medusa (Cai et al., 2024) eliminates the separate draft model. Instead it bolts $K$ lightweight *prediction heads* onto the target, where head $k$ predicts the token $k{+}1$ positions ahead — i.e. $y{t+k+1}$ — from the *same* hidden state $ht$ (the base LM head already supplies the immediate next token $y_{t+1}$):

$\text{Head}k(ht) = \text{softmax}\!\Big(Wk^{(2)}\big(\text{SiLU}(Wk^{(1)} ht + bk) + h_t\big)\Big),$

a single SiLU block with a residual skip *in feature space*, where $Wk^{(2)}$ is initialized to the original LM head (and $Wk^{(1)}$ to zero) so each head starts out reproducing the base model's prediction. Only the heads are trained (base frozen), minimizing $\mathcal{L}k = -\sumt \log \text{Head}k(ht)[y_{t+k+1}]$.

Crucially, the heads do not propose a single chain but a tree of candidates: if head $k$ keeps its top-$sk$ tokens, the tree has $\prodk s_k$ leaves, pruned to the most probable branches. The tree is verified in *one* pass via tree attention — a mask where a node attends only to its ancestors:

$M_{ij} = \begin{cases} 0 & \text{if } j \text{ is an ancestor of } i \text{ (or } j=i)\\ -\infty & \text{otherwise.}\end{cases}$

Plus: no separate draft weights, fully parallel drafting (all heads read one hidden state), trainable in a few hours. Minus: requires training; head accuracy decays for larger $k$; tree size grows combinatorially. (Vanilla Medusa is *approximately* lossless because the heads sample from their own distribution rather than the target's.)

3. Lookahead decoding

Lookahead decoding (Fu et al., 2024) takes a radically different tack: it frames autoregressive generation as solving a fixed-point system by Jacobi iteration. View $\gamma$ future tokens as unknowns $yt = f(y1, \ldots, y_{t-1})$, initialize them randomly, and update all positions in parallel:

$yt^{(s+1)} = f\big(y1^{(s)}, \ldots, y_{t-1}^{(s)}\big) \quad \text{for all } t \text{ simultaneously.}$

A position that stops changing ($yt^{(s+1)} = yt^{(s)}$) has converged; several can converge at once. The iterations are not wasted: they deposit *n-gram trajectories* into a growing pool, which then serve as candidate continuations verified in parallel via tree attention.

Plus: training-free, model-agnostic, no draft model, and it improves as the n-gram pool grows. Minus: higher per-step overhead; acceptance depends heavily on how repetitive/predictable the text is; pool management adds complexity.

4. Staged speculative decoding

Staged speculative decoding (Spector & Ré, 2023) uses a *cascade* of progressively larger models — a tiny model drafts for a small model, which drafts for the target. Early stages are nearly free (a tiny model fits in L2 cache), and because each stage boundary applies the exact rejection-sampling rule, the composition of exact samplers is still exact. The speedup compounds across stages.

5. SpecInfer: tree-structured parallel decoding

In SpecInfer (Miao et al., 2024), multiple *Small Speculative Models* (SSMs) independently propose sequences, which are merged by common prefix into a single tree. For instance, if SSM1 proposes A→B→C and A→B→D while SSM2 proposes A→B→C and A→E→F, the shared prefixes collapse into one tree:

`mermaid graph TD A((A)) --> B((B)) A --> E((E)) B --> C((C)) B --> D((D)) E --> F((F)) `

Figure: four candidate sequences from two SSMs, merged by shared prefix into a single draft tree the target verifies in one pass.

The target verifies all paths at once with *topology-aware causal attention* (each node attends only to its ancestors). Plus: diverse candidates from several SSMs widen the explored token space. Minus: memory cost of multiple draft models; tree-size management.

6. Sequoia: optimal trees by dynamic programming

Sequoia (Chen et al., 2024) asks the structural question directly: *what tree maximizes expected accepted tokens for a fixed node budget?* Let $T(m, d)$ be the optimal expected tokens for a subtree of $m$ nodes at depth $d$:

$T(m, d) = \max{(m1, \ldots, mc):\,\sumi mi = m-1} \Big[\,1 + \sum{i=1}^{c} pi \cdot T(mi, d+1)\,\Big],$

where $p_i$ is the acceptance probability of the $i$-th child. The root contributes 1, and the children contribute additively — every branch of the tree accumulates acceptance mass, not just the single best one. The optimal solution is wider at the top (hedge against early rejection) and narrower at depth (deep nodes have lower cumulative acceptance $\alpha^d$). Sequoia also picks the hardware-optimal *total* tree size $m^\star$ as a function of bandwidth, compute throughput, and sequence length — an early example of sizing a draft tree to the hardware, a concern that returns in Parallel Drafting with Block Diffusion when draft-tree budgets meet verifier overhead.

7. Online speculative decoding

Standard SD fixes the draft model. Online speculative decoding (Liu et al., 2024) *adapts* it during inference, exploiting a free supervision signal: verification already computes both $p(v)$ and $q(v)$ for every token. After each step it nudges the draft toward the target,

$\thetaq \leftarrow \thetaq - \eta\,\nabla{\thetaq}\,\text{KL}\big(p \,\Vert\, q{\thetaq}\big),$

using a distillation buffer of recent (context, target-distribution) pairs. Plus: draft quality rises over a session and specializes to the live distribution. Minus: training-in-the-loop overhead; needs careful tuning of update frequency.

8. Self-speculative decoding

Draft & Verify (Zhang et al., 2023) and LayerSkip (Elhoushi et al., 2024) both let the target model draft *for itself* by skipping layers. LayerSkip runs the first $L'$ of $L$ layers and applies the LM head to that intermediate state as the draft; verification runs the full stack and reuses the draft's KV cache for the first $L'$ layers. (The related Draft & Verify instead drafts through a sparse, automatically selected subset of layers and uses a confidence threshold to decide *when to stop drafting* — an adaptive draft length — rather than a fixed exit layer.) Plus: zero extra parameters, no separate KV cache. Minus: early-layer predictions are often weak; needs trained early-exit heads or a confidence predictor. (It remains *exactly* lossless — verification still applies the standard rejection-sampling rule, so skipping layers only affects speed, never the output.)

The comparison matrix

| Method | Draft source | Training | Memory overhead | Tree structure | Lossless | |--------|--------------|----------|-----------------|----------------|----------| | Vanilla SD | Separate small model | None | Full draft model | Flat | Yes | | Medusa | Prediction heads | Heads | Small (heads only) | Tree | Approx. | | Lookahead | Self (Jacobi) | None | N-gram pool | Tree | Yes (greedy) | | Staged SD | Cascade | None | Multiple models | Flat | Yes | | SpecInfer | Multiple SSMs | None | Multiple SSMs | Tree | Yes | | Sequoia | Any | Topology opt. | Depends on draft | Optimal tree | Yes | | Online SD | Adapted draft | Online distill. | Draft model | Flat | Yes | | Self-SD | Self (early exit) | Optional | None | Flat | Yes | | EAGLE | Feature predictor | Predictor | Small predictor | Tree (dynamic in 2+) | Yes |

Read down the columns and a pattern emerges. The field is steadily moving toward drafts that need no separate model, trees instead of chains, and training-free deployment — three currents that converge in the most influential line of work to date. That work has its own arc worth telling in full: in The EAGLE Family we follow the EAGLE family, whose central bet — that *hidden states are more predictable than tokens* — reshaped how the field thinks about drafting.

---

References

1. Leviathan, Y., Kalman, M., & Matias, Y. (2023). *Fast Inference from Transformers via Speculative Decoding.* ICML. arXiv:2211.17192. 2. Chen, C., et al. (2023). *Accelerating Large Language Model Decoding with Speculative Sampling.* arXiv:2302.01318. 3. Cai, T., Li, Y., Geng, Z., Peng, H., Lee, J. D., Chen, D., & Dao, T. (2024). *Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads.* ICML. arXiv:2401.10774. 4. Fu, Y., Bailis, P., Stoica, I., & Zhang, H. (2024). *Break the Sequential Dependency of LLM Inference Using Lookahead Decoding.* arXiv:2402.02057. 5. Spector, B., & Ré, C. (2023). *Accelerating LLM Inference with Staged Speculative Decoding.* arXiv:2308.04623. 6. Miao, X., et al. (2024). *SpecInfer: Accelerating Generative LLM Serving with Tree-based Speculative Inference and Verification.* ASPLOS. arXiv:2305.09781. 7. Chen, Z., May, A., Svirschevski, R., Huang, Y., Ryabinin, M., Jia, Z., & Chen, B. (2024). *Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding.* arXiv:2402.12374. 8. Liu, X., et al. (2024). *Online Speculative Decoding.* ICML. arXiv:2310.07177. 9. Zhang, J., et al. (2023). *Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding.* arXiv:2309.08168. 10. Elhoushi, M., et al. (2024). *LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding.* arXiv:2404.16710.