Putting It to Work: Serving Speculative Decoding with vLLM and SGLang

June 25, 2026

*Part 6 of 6 — Speculative Decoding series. Start with The Need for Speed.*

1. The Need for Speed — Why LLMs are memory-bandwidth-bound and how speculation exploits the GPU's idle compute. 2. Speculative Decoding, Formally — The draft-then-verify algorithm, the rejection-sampling proof, and the metrics that matter. 3. A Field Guide to Speculative Decoding Methods — Separate models, Medusa, n-gram, and trees: a taxonomy of every drafting approach. 4. The EAGLE Family — How predicting hidden states instead of tokens reshaped the field, across three generations. 5. Parallel Drafting with Block Diffusion — DFlash collapses γ draft passes into one; DDTree turns that pass into a 7× verification tree. 6. Putting It to Work (this post) — Enabling EAGLE, n-gram, and Medusa speculation in vLLM and SGLang, with a guide to measuring real speedup.

---

*Speculative decoding makes an LLM generate faster while provably leaving its output unchanged — a cheap drafter proposes several tokens and the target verifies them in one parallel pass (lossless, via rejection sampling), cashing in the fact that decoding is memory-bandwidth-bound and the GPU is otherwise idle. The drafters have grown steadily more capable — separate small models, Medusa heads, feature-level EAGLE predictors, and parallel block-diffusion drafters — all to raise the accepted-tokens-per-step that drives the speedup.*

This final post is operational. The good news is that you almost never implement speculative decoding by hand anymore — the two dominant open-source inference engines, vLLM and SGLang, ship it as a configuration flag. Below is how to turn it on, how to choose a method, and how to confirm the speedup is real on *your* workload.

Version caveat. Speculative-decoding APIs in both engines have moved fast. The snippets below reflect the stable interfaces as of early 2026 (vLLM's speculative_config on the V1 engine; SGLang's --speculative-* flags). Always cross-check against the docs for your installed version — argument names and supported methods change between releases.

---

A decision guide: which method?

Before any code, pick a method to match your situation. The trade-offs are exactly the ones from Speculative Decoding, Formally's speedup formula $S \approx \bar k / (1 + \gamma\rho)$ and A Field Guide to Methods's taxonomy:

| If you… | Use | Why | |---------|-----|-----| | Have a small model in the same family | Draft model (vanilla SD) | Zero training, lossless, simple | | Want the best general speedup, can grab a checkpoint | EAGLE-3 | Top acceptance, plug-and-play, only a lightweight draft head to host (no separate full model) | | Generate repetitive/structured output (code, SQL, JSON, agents) | N-gram / prompt-lookup (or SuffixDecoding-style) | Free, no model, huge wins on repetition | | Are latency-bound at low batch size / low load | Any of the above | SD shines when the GPU is idle | | Are throughput-bound at high batch size | *Reconsider* | At large batches the GPU is already compute-bound; SD's extra verify work can *hurt* |

That last row is the single most important operational caveat, and it follows directly from The Need for Speed: speculation pays for the GPU's *idle* capacity. Under heavy batched load the idle capacity is gone, so the wasted-draft overhead can dominate. Speculation is primarily a latency optimization for low-to-moderate load, and most engines let you disable it dynamically as load rises.

---

vLLM

Offline (the `LLM` class)

In vLLM, speculative decoding is configured through a single speculative_config dict. The output is identical to non-speculative decoding for the lossless methods — only the latency changes.

EAGLE-3 draft (recommended general-purpose):

`python from vllm import LLM, SamplingParams

llm = LLM( model="meta-llama/Llama-3.1-8B-Instruct", tensorparallelsize=1, speculative_config={ "method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B", "numspeculativetokens": 5, # draft length γ }, )

params = SamplingParams(temperature=0.0, max_tokens=256) out = llm.generate(["Explain speculative decoding in one paragraph."], params) print(out[0].outputs[0].text) `

Classic separate draft model (vanilla SD):

`python llm = LLM( model="meta-llama/Llama-3.1-70B-Instruct", tensorparallelsize=4, speculative_config={ "model": "meta-llama/Llama-3.2-1B-Instruct", # the draft "numspeculativetokens": 5, }, ) `

N-gram / prompt-lookup (free, great for repetitive output):

`python llm = LLM( model="meta-llama/Llama-3.1-8B-Instruct", speculative_config={ "method": "ngram", "numspeculativetokens": 5, "promptlookupmax": 4, # longest n-gram to match "promptlookupmin": 2, # shortest n-gram to match }, ) `

This method drafts by copying continuations of n-grams it has already seen in the prompt + generated text — no model, no memory cost. It is astonishingly effective for summarization, code editing, and RAG, where the output echoes the input.

Online server

Everything above maps onto vllm serve with --speculative-config taking the same JSON:

`bash vllm serve meta-llama/Llama-3.1-8B-Instruct \ --speculative-config '{"method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B", "numspeculativetokens": 5}' `

The server then exposes the standard OpenAI-compatible API — clients need no changes:

`bash curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "prompt": "def fibonacci(n):", "max_tokens": 128, "temperature": 0}' `

---

SGLang

SGLang exposes speculative decoding through --speculative-* launch flags. Its EAGLE integration is mature and well-tuned; the three numbers to know map directly onto the tree concepts from earlier: num-steps is the draft depth, eagle-topk the branching factor, and num-draft-tokens the total tree nodes verified per step.

EAGLE-3 server:

`bash python3 -m sglang.launch_server \ --model-path meta-llama/Llama-3.1-8B-Instruct \ --speculative-algorithm EAGLE3 \ --speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B \ --speculative-num-steps 5 \ --speculative-eagle-topk 8 \ --speculative-num-draft-tokens 64 \ --dtype float16 `

Checkpoint formats differ between engines. SGLang's EAGLE3 expects a SpecForge-format draft checkpoint (e.g. jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B or the lmsys/SGLang-EAGLE3-…-SpecForge series), *not* the yuhuili/EAGLE3-… checkpoints that vLLM and the reference EAGLE repo load. Using the wrong format is a common first-run failure — match the checkpoint to the engine.

EAGLE (v1/v2 checkpoints) use the same flags with --speculative-algorithm EAGLE and the matching draft checkpoint. SGLang also exposes an OpenAI-compatible endpoint on :30000 by default:

`bash curl http://localhost:30000/v1/completions \ -H "Content-Type: application/json" \ -d '{"model": "default", "prompt": "Write a haiku about GPUs.", "max_tokens": 64, "temperature": 0}' `

Tuning the tree. The three speculative knobs trade acceptance against verification cost — exactly the tension Parallel Drafting with Block Diffusion was about. Larger num-steps and eagle-topk raise $\bar k$ but inflate num-draft-tokens and thus verification cost. Sensible starting points are num-steps 3–6, eagle-topk 4–8, num-draft-tokens 32–64; sweep them on your traffic and keep the configuration that maximizes measured tokens/second, not acceptance rate.

---

Measuring the speedup honestly

Speculative Decoding, Formally warned that acceptance rate is necessary but not sufficient — only the clock pays the bills. Always benchmark with and without speculation on *your* prompts, at *your* batch size. A minimal offline A/B:

`python import time from vllm import LLM, SamplingParams

prompts = [...] # a representative sample of YOUR traffic params = SamplingParams(temperature=0.0, max_tokens=256)

def bench(llm, label): t0 = time.perf_counter() outs = llm.generate(prompts, params) dt = time.perf_counter() - t0 ntok = sum(len(o.outputs[0].tokenids) for o in outs) print(f"{label}: {ntok/dt:,.1f} tok/s ({dt:.2f}s for {ntok} tokens)")

baseline = LLM(model="meta-llama/Llama-3.1-8B-Instruct") bench(baseline, "baseline")

spec = LLM( model="meta-llama/Llama-3.1-8B-Instruct", speculative_config={"method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B", "numspeculativetokens": 5}, ) bench(spec, "eagle3") `

For server deployments, drive load with a realistic concurrency sweep (e.g. vLLM's benchmarkserving.py or SGLang's benchserving) and watch per-request latency and tokens/second as a function of concurrency. You will typically see the speculative speedup *shrink* as concurrency rises — the expected behavior from The Need for Speed, and the reason many production stacks gate speculation on current load.

Checklist for an honest benchmark:

1. Use *your* prompt distribution — acceptance is workload-dependent (Speculative Decoding, Formally). 2. Match temperature to production — lower temperature usually accepts more. 3. Sweep batch size / concurrency — the speedup is load-dependent. 4. Report tokens/second and p50/p99 latency, not just acceptance rate. 5. Verify correctness: with temperature=0 and a lossless method, speculative and baseline outputs should match token-for-token.

---

Conclusion: a free lunch, served

Speculative decoding is the rare optimization that asks you to give up nothing. The output distribution is provably unchanged (Speculative Decoding, Formally); the only currency you spend is some wasted draft compute, drawn from a pool — the GPU's idle decode cycles (The Need for Speed) — that was going to waste anyway.

The throughline of this series is a single asymmetry: generation is sequential, verification is parallel, and modern hardware is starved for parallel work during decoding. Everything else — Medusa's heads, EAGLE's feature predictor, Sequoia's optimal trees, DFlash's block-diffusion drafting — is a more and more sophisticated way to manufacture good guesses cheaply and check them in bulk. The field's trajectory has been steadily toward *simpler deployment* (training-free, plug-and-play) and *broader scope* (reasoning, SSMs, retrieval, vision), and its center of gravity has moved from research code into the inference engines you can pip install today.

So the practical advice is short. Pick a method that matches your workload, turn it on with a flag, and measure the clock. For most latency-sensitive deployments, a few lines of configuration buy a 2–4× speedup for free. That is about as close to a free lunch as systems engineering ever gets.

---

References

1. vLLM — *Speculative Decoding* documentation. https://docs.vllm.ai/ (see the speculative_config / --speculative-config reference for your version). 2. SGLang — *Speculative Decoding (EAGLE)* documentation. https://docs.sglang.ai/ 3. Kwon, W., et al. (2023). *Efficient Memory Management for Large Language Model Serving with PagedAttention.* ACM SOSP. (vLLM) 4. Li, Y., Wei, F., Zhang, C., Zhang, H., et al. (2025). *EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test.* arXiv:2503.01840. 5. Leviathan, Y., Kalman, M., & Matias, Y. (2023). *Fast Inference from Transformers via Speculative Decoding.* ICML. arXiv:2211.17192. 6. Saxena, A. (2023). *Prompt Lookup Decoding.* (n-gram / prompt-lookup drafting). https://github.com/apoorvumang/prompt-lookup-decoding 7. NVIDIA. *TensorRT-LLM* — speculative decoding support. https://github.com/NVIDIA/TensorRT-LLM