The EAGLE Family: Speculating in Feature Space
*Part 4 of 6 — Speculative Decoding series. Start with The Need for Speed.*
1. The Need for Speed — Why LLMs are memory-bandwidth-bound and how speculation exploits the GPU's idle compute. 2. Speculative Decoding, Formally — The draft-then-verify algorithm, the rejection-sampling proof, and the metrics that matter. 3. A Field Guide to Speculative Decoding Methods — Separate models, Medusa, n-gram, and trees: a taxonomy of every drafting approach. 4. The EAGLE Family (this post) — How predicting hidden states instead of tokens reshaped the field, across three generations. 5. Parallel Drafting with Block Diffusion — DFlash collapses γ draft passes into one; DDTree turns that pass into a 7× verification tree. 6. Putting It to Work — Enabling EAGLE, n-gram, and Medusa speculation in vLLM and SGLang, with a guide to measuring real speedup.
---
*Speculative decoding speeds up LLM generation losslessly — a small drafter proposes tokens and the target verifies them in parallel, with a rejection rule that preserves the exact output distribution. The drafters that have emerged range from separate small models to Medusa-style prediction heads, n-gram lookup, and trees of candidate tokens, with the field trending toward drafts that need no separate model and (increasingly) no training.*
No single line of work pushes that arc further than EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency). Across three iterations it went from a learned feature-level drafter, to a smarter dynamic-tree version, to a token-level drafter with multi-layer feature fusion that scales to the state of the art — all on the strength of one deceptively simple bet:
Hidden states are more predictable than tokens.
This post follows that bet through all three versions.
Why predict features instead of tokens?
At each step, the model turns a token embedding into a hidden state $h_t \in \mathbb{R}^d$ through $L$ transformer layers, and the LM head projects it to a vocabulary distribution:
$P(y{t+1} \mid y{\le t}) = \text{softmax}(W{\text{LM}} ht).$
That final map is deterministic — *all* the uncertainty about the next token lives in $h_t$. Two consequences follow:
1. Feature space is smoother than token space. A small change in $h_t$ yields a small change in the output distribution. But a small change in the *token* — discrete embedding lookup — can jump to an entirely different hidden state. The continuous signal is the more stable thing to predict. 2. Regression beats classification. Predicting the vector $h{t+1}$ from $(ht, e_{t+1})$ is a regression problem, inherently easier than classifying over a 32K+ vocabulary.
Empirically, these feature trajectories are far more regular than the token stream they induce — smooth enough that a tiny network can extrapolate them. That regularity is the opening EAGLE exploits.
EAGLE-1: feature-level autoregression
EAGLE-1 (Li et al., 2024, *Speculative Sampling Requires Rethinking Feature Uncertainty*) introduces a lightweight *feature predictor* $f_\theta$ that forecasts the next hidden state from the current one and the next token's embedding:
$\hat h{t+1} = f\theta(ht, e{t+1}).$
There is a subtlety: we do not know $e{t+1}$ in advance. EAGLE resolves it autoregressively — sample a token $\hat y{t+1}$ from the distribution induced by $\hat h_t$, look up its embedding, and feed that back in. The predictor itself is a single transformer decoder layer fusing state and embedding:
$f\theta(ht, e{t+1}) = \text{TransformerLayer}\big(\text{Concat}(ht, e{t+1})\,W{\text{fuse}}\big), \quad W_{\text{fuse}} \in \mathbb{R}^{2d \times d}.$
Training is cheap and self-supervised — collect $(ht, e{t+1}, h_{t+1})$ tuples from the frozen base model and fit a combined regression-plus-classification objective:
$\mathcal{L} = \mathcal{L}{\text{reg}} + w{\text{cls}}\,\mathcal{L}{\text{cls}}, \qquad \mathcal{L}{\text{reg}} = \text{SmoothL1}\big(f\theta(ht, e{t+1}),\, \text{sg}(h{t+1})\big),$
where $\mathcal{L}{\text{cls}}$ is a cross-entropy term on the token distribution the predicted feature induces. The paper sets $w{\text{cls}} = 0.1$, since the classification loss is numerically the larger of the two; stop-gradient is applied to the feature targets. A few hours on a single GPU using ShareGPT-style data suffices.
Drafting builds a tree. Rather than a single chain, EAGLE-1 keeps the top-$k$ tokens at each step and grows them into a *fixed-shape* draft tree — the same structure every decoding step (it is this static shape that EAGLE-2 will later make adaptive). The tree is verified in one target pass via tree attention (a node attends only to its ancestors). The longest accepted root-to-leaf path determines the committed tokens.
Result: a 2.7–3.5× speedup on LLaMA-2-Chat 70B, lossless — clearly ahead of vanilla SD and Medusa.
EAGLE-2: context-aware dynamic trees
EAGLE-1's tree shape is *static* — the same branching pattern regardless of context. EAGLE-2 (Li et al., 2024) rests its improvement on one empirical fact: the draft model is well-calibrated. Its confidence score — the top-1 probability $ct = \maxv \hat p(v \mid \text{prefix}_t)$ — closely approximates the token's actual acceptance rate, with small error. The drafter already "knows" which branches are worth pursuing, so no extra calibration model or retraining is needed.
EAGLE-2 turns this into a two-phase, inference-time tree-construction procedure (the EAGLE-1 drafter is left unchanged):
1. Expand. Grow the tree from the most promising nodes, ranked by *global acceptance probability* — the product of confidence scores along the root-to-node path. 2. Rerank. Across the whole expanded tree, keep the top-$m$ tokens by that same value and flatten them (with a tree-structured attention mask) into the single sequence the target verifies.
Where the draft is confident the tree naturally grows *deeper*; where it is unsure it branches *wider* — all from sorting confidence scores, with no per-step Bellman optimization.
Result: roughly 4× on LLaMA-2-Chat 70B — about 20–40% faster than EAGLE-1, entirely from smarter draft-tree allocation rather than a better drafter.
EAGLE-3: scaling the drafter with data
EAGLE-3 (Li et al., 2025, *…via Training-Time Test*) starts from a limitation of its predecessors: EAGLE-1/2 pin the drafter to a *feature-regression* objective — predict the target's next hidden state, then map it through the LM head. EAGLE-3's key finding is that this constraint *caps* the drafter: the feature-prediction loss becomes a bottleneck, so feeding it more training data stops helping. Two changes remove the ceiling:
1. Direct token prediction. EAGLE-3 drops the feature-regression loss and trains the draft head to predict the *next token* directly. No longer forced to match a specific hidden vector, the drafter keeps improving as training data grows. 2. Multi-layer feature fusion ("training-time test"). Instead of consuming only the top-layer feature, the drafter fuses *low-, middle-, and high-level* features from the target. At inference the drafter must consume features it generated itself over multiple steps; to avoid a train/inference mismatch, training simulates that multi-step rollout — the "training-time test" the title refers to.
Crucially, EAGLE-3 still trains a draft module and adds parameters — it is not training-free. What it buys is *scalability*: freed from feature regression, more training data → higher acceptance → larger speedup, a curve earlier EAGLEs flattened on. It inherits EAGLE-2's confidence-based dynamic tree for drafting.
Result: up to a 6.5× speedup, roughly 1.4× over EAGLE-2, with the gains continuing to grow as training data and model scale increase.
The three versions side by side
| Aspect | EAGLE-1 | EAGLE-2 | EAGLE-3 | |--------|---------|---------|---------| | Training | Required (~hours) | Required (~hours) | Required | | Extra parameters | Feature predictor | Same predictor as E1 | Draft module | | Tree structure | Static | Dynamic (confidence) | Dynamic (confidence, from E2) | | Draft mechanism | Learned feature regression | + dynamic tree (expand & rerank) | Direct token prediction + multi-layer fusion | | Acceptance rate | ~0.75–0.85 | ~0.78–0.88 | Higher (scales with data) | | Speedup (70B) | 2.7–3.5× | ~4× | ↑ ~1.4× over E2 (up to 6.5× peak) | | Deployment | Moderate | Moderate | Easy (released heads) |
The EAGLE design philosophy
Four principles recur across the line:
1. Feature space > token space. Predicting continuous representations is easier and more informative than predicting discrete tokens. 2. Structure matters. Tree topology materially changes performance; static trees leave speed on the table, dynamic trees adapt to varying difficulty. 3. Scale, don't constrain. EAGLE-3's jump came from *removing* the feature-regression objective — predicting tokens directly — so the drafter keeps improving with more data, rather than from shrinking or eliminating it. 4. Hardware awareness. Every variant produces tree-structured drafts that map onto batched GPU matmuls via tree attention.
For all its sophistication, though, EAGLE shares one limit with every method before it: it drafts *autoregressively* — one feature-prediction step per token, $\gamma$ steps for $\gamma$ tokens. The next leap drops that assumption entirely. In Parallel Drafting with Block Diffusion we turn to block-diffusion drafting: DFlash denoises a whole block of draft tokens in a single parallel pass, and DDTree turns that one pass into a verification tree — together pushing speedups past 6×.
---
References
1. Li, Y., Wei, F., Zhang, C., & Zhang, H. (2024). *EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty.* ICML. arXiv:2401.15077. 2. Li, Y., Wei, F., Zhang, C., & Zhang, H. (2024). *EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees.* EMNLP. arXiv:2406.16858. 3. Li, Y., Wei, F., Zhang, C., Zhang, H., et al. (2025). *EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test.* arXiv:2503.01840. 4. Cai, T., et al. (2024). *Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads.* ICML. arXiv:2401.10774. (for the tree-attention comparison) 5. Leviathan, Y., Kalman, M., & Matias, Y. (2023). *Fast Inference from Transformers via Speculative Decoding.* ICML. arXiv:2211.17192.