Parallel Drafting with Block Diffusion: DFlash and DDTree

June 25, 2026

*Part 5 of 6 — Speculative Decoding series. Start with The Need for Speed.*

1. The Need for Speed — Why LLMs are memory-bandwidth-bound and how speculation exploits the GPU's idle compute. 2. Speculative Decoding, Formally — The draft-then-verify algorithm, the rejection-sampling proof, and the metrics that matter. 3. A Field Guide to Speculative Decoding Methods — Separate models, Medusa, n-gram, and trees: a taxonomy of every drafting approach. 4. The EAGLE Family — How predicting hidden states instead of tokens reshaped the field, across three generations. 5. Parallel Drafting with Block Diffusion (this post) — DFlash collapses γ draft passes into one; DDTree turns that pass into a 7× verification tree. 6. Putting It to Work — Enabling EAGLE, n-gram, and Medusa speculation in vLLM and SGLang, with a guide to measuring real speedup.

---

*Speculative decoding accelerates LLMs losslessly by drafting tokens cheaply and verifying them in parallel; the strongest drafters today are feature-level predictors like EAGLE, which bolt a tiny head onto the target and forecast its next hidden state.*

But every method so far — separate draft models, Medusa, EAGLE — shares one buried assumption: drafting is autoregressive. To propose $\gamma$ tokens you run the drafter $\gamma$ times, one token per step. EAGLE makes each step cheap; it cannot make the chain short.

Our speedup formula from Speculative Decoding, Formally charges for exactly this. Recall

$S \approx \frac{\bar k}{1 + \gamma\rho},$

where $\rho = cq/cp$ is the per-pass draft cost. The $\gamma$ in the denominator is the price of drafting $\gamma$ tokens *sequentially* — $\gamma$ separate draft passes. The two methods in this post attack that $\gamma$ head-on. DFlash replaces the autoregressive drafter with a block diffusion model that emits a whole block of $\gamma$ tokens in a *single* parallel forward pass. DDTree then observes that this one diffusion pass carries far more information than a single draft sequence — a full distribution at every position — and spends it on a verification tree.

Why autoregressive drafting is the bottleneck

Write the draft latency for $\gamma$ tokens two ways:

$T{\text{draft}}^{\text{AR}} = \gamma \cdot t{\text{step}}, \qquad T{\text{draft}}^{\text{diff}} \approx t{\text{parallel}}.$

Autoregressive drafting grows *linearly* in $\gamma$. A diffusion drafter that denoises all $\gamma$ positions at once pays a single parallel pass — roughly constant in $\gamma$. Because a modern GPU is memory-bandwidth-bound at batch size 1 (The Need for Speed), running $\gamma$ positions through one forward pass costs barely more than running one, so $t{\text{parallel}} \ll \gamma\, t{\text{step}}$.

This has a second-order payoff. Once drafting is a single pass, you can afford a *deeper, more expressive* drafter — DFlash uses 5–8 transformer layers — without re-incurring the per-token latency that pins autoregressive drafters to one or two layers. In the speedup formula, the denominator collapses from $1 + \gamma\rho$ toward $1 + \rho_{\text{block}}$: the $\gamma$ multiplier on the draft cost disappears.

DFlash: a block-diffusion drafter

DFlash (Chen, Liang & Liu, ICML 2026) makes the drafter a *block diffusion* model — one that fills a whole block of draft tokens in a single parallel pass. The mechanism, built up below.

Block diffusion in one breath

A *block diffusion* language model (Arriola et al., 2025) sits between autoregressive and diffusion models: it splits a sequence into blocks, generates blocks left-to-right, but denoises all tokens within a block in parallel by iteratively unmasking them. Tuning the block size interpolates between pure AR (block size 1) and pure diffusion (one block). For drafting, the relevant property is simple: a block of $\gamma$ masked positions is filled in one forward pass, not $\gamma$.

Drafting by denoising

DFlash uses a lightweight block diffusion model as the drafter. To propose the next $\gamma$ tokens it lays down a block of $\gamma$ masked positions and denoises them simultaneously, yielding a draft block in a single parallel pass — the $T_{\text{draft}}^{\text{diff}}$ above. DFlash uses a block size of 16 (10 for LLaMA-3.1).

Conditioning on the target: KV injection

A drafter is only useful if it tracks the target. DFlash extracts hidden representations from several intermediate layers of the *frozen* target, concatenates them, and passes them through a lightweight projection into a compact target context feature. Rather than feeding that feature only at the input, DFlash injects it into the key and value projections of every draft-model layer — a persistent conditioning signal that runs through the whole draft network. This is what lets the accepted length keep scaling as the draft block deepens.

Training and verification

The draft model (a 5-layer transformer, 8 for larger targets) shares the target's frozen token embeddings and LM head, and is trained self-supervised with a few practical tricks: random anchor sampling so training blocks match the inference block structure, exponentially decaying per-position loss weights $w_k = \exp\!\big(-(k-1)/\gamma\big)$ that emphasize the early (more-likely-accepted) positions, and sparse attention masks via Flex Attention. Verification is the standard speculative-decoding step — the target checks the drafted block in one pass and accepts the tokens consistent with its own distribution, resampling at the first mismatch. So DFlash is lossless: the output distribution is exactly the target's.

Results

DFlash reports over 6× lossless acceleration across models and tasks. On Qwen3-8B at temperature 0 it averages 4.86× speedup (6.08× on coding), against 1.76× for EAGLE-3 with a size-16 tree — roughly 2.5–2.75× faster than EAGLE-3. Under realistic batched serving in SGLang (concurrency 1–32) it reaches up to 5.1× while preserving quality.

DDTree: turn the block into a tree

DDTree — short for *Diffusion Draft Tree* (Ringel & Romano, 2026) — builds on DFlash, spending a single diffusion pass on a whole tree of candidates instead of one sequence.

The wasted information

A single block-diffusion pass produces, for *every* future position $i$ in the block, a full marginal distribution $q_i(\cdot)$ — the model's belief about the token at position $i$ without conditioning on earlier choices in the block. Vanilla DFlash collapses all of that into one trajectory (e.g. the per-position argmax) and verifies only that single sequence. The rest of each distribution is thrown away.

Building the tree: a best-first heap

DDTree spends that discarded mass on a draft tree. Each tree node is a candidate prefix $(u1, \ldots, ud)$; the goal is to pick, under a node budget $B$, the prefixes most likely to match the target. The key structural fact is that the expected accepted length decomposes as an additive sum of prefix probabilities over the tree's nodes — so the objective is additive and the best nodes are simply the highest-probability prefixes.

Because the marginals are independent across positions, a prefix's score is

$\sigma(\rho) = \sum{i=1}^{d} \log qi^{(\rho_i)},$

where $\rho = (\rho1, \ldots, \rhod)$ records the *rank* chosen at each position ($\rho_i = 1$ is that position's top token). DDTree enumerates the top-$B$ prefixes without touching the exponential space, using a best-first max-heap (Algorithm 1): start from $(1)$; repeatedly pop the highest-scoring tuple and push two successors — a *sibling* that increments the last rank and an *extension* that appends rank 1 at depth $d+1$; stop after $B$ pops. Total cost $O(B \log B)$. Reported sweet spot: $B \approx 256$–$512$ before verifier overhead dominates.

Verification

The selected prefixes are flattened into one sequence rooted at the bonus token and verified in a single target pass with ancestor-only tree attention: each node attends to the past context (via the KV cache), the root/bonus token, and its own ancestors, with position IDs assigned by tree depth. The verifier then walks the tree — at each level, if the target's chosen token matches a child node, that node is accepted (reusing the logits already computed in the single pass) and the walk descends; at the first mismatch it stops, and the unmatched target token becomes the next bonus token. As with DFlash, verification is exact, so DDTree is lossless.

Results

Feeding the same DFlash drafter through a tree lifts every number. On Qwen3-8B at temperature 0:

| Dataset | DFlash | DDTree | DDTree accept. length $\tau$ | |---------|--------|--------|------------------------------| | MATH-500 | 5.56× | 7.52× | 10.73 | | HumanEval | 4.84× | 6.90× | 9.67 | | GSM8K | 4.78× | 6.75× | 9.54 |

A consistent 25–40% gain over vanilla DFlash, purely from using the per-position distributions a single diffusion pass already produced. Evaluated across Qwen3-4B/8B and Qwen3-Coder-30B on ten reasoning, code, and general benchmarks at temperatures 0 and 1.

The shape of the idea

The two methods compose into one clean story:

DFlash kills the $\gamma$ in the draft cost — block diffusion drafts a whole block in a single parallel pass, so $1 + \gamma\rho$ collapses toward $1 + \rho_{\text{block}}$ and a deeper drafter becomes affordable.

DDTree raises $\bar k$ — the same single diffusion pass already encodes a distribution at every position, so building a tree from those marginals (rather than one trajectory) buys more accepted tokens per step for almost no extra draft cost.

It is the same wide-tree intuition that Sequoia and EAGLE-2 (A Field Guide to Methods and The EAGLE Family) arrived at — but the tree is now fed by a *parallel* drafter's marginals, and the selection is an exact best-first enumeration rather than a learned or heuristic expansion. The broader lesson is that drafting need not be autoregressive at all: the moment you can produce many positions' distributions in one pass, both halves of the speedup formula — shorter draft latency and higher acceptance — improve together.

Looking ahead

Block diffusion is one of several directions the frontier is exploring as speculative decoding reaches beyond a single small drafter — toward reasoning models, non-Transformer backbones, retrieval, and multimodality. But however the draft is produced, the payoff only matters if you can turn it on in a real serving stack. In Putting It to Work we land the whole series in running code — enabling speculative decoding in vLLM and SGLang, and measuring whether the speedup is real on your workload.

---

References

1. Chen, J., Liang, Y., & Liu, Z. (2026). *DFlash: Block Diffusion for Flash Speculative Decoding.* ICML. arXiv:2602.06036. 2. Ringel, L., & Romano, Y. (2026). *Accelerating Speculative Decoding with Block Diffusion Draft Trees (DDTree).* arXiv:2604.12989. 3. Arriola, M., Gokaslan, A., Chiu, J. T., Yang, Z., Qi, Z., Han, J., Sahoo, S. S., & Kuleshov, V. (2025). *Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models.* ICLR (Oral). arXiv:2503.09573. 4. Li, Y., Wei, F., Zhang, C., Zhang, H., et al. (2025). *EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test.* arXiv:2503.01840. 5. Leviathan, Y., Kalman, M., & Matias, Y. (2023). *Fast Inference from Transformers via Speculative Decoding.* ICML. arXiv:2211.17192.