Putting It to Work: Serving Speculative Decoding with vLLM and SGLang
*Part 6 of 6 — Speculative Decoding series. Start with The Need for Speed.*
1. The Need for Speed — Why LLMs are memory-bandwidth-bound and how speculation exploits the GPU's idle compute. 2. Speculative Decoding, Formally — The draft-then-verify algorithm, the rejection-sampling proof, and the metrics that matter. 3. A Field Guide to Speculative Decoding Methods — Separate models, Medusa, n-gram, and trees: a taxonomy of every drafting approach. 4. The EAGLE Family — How predicting hidden states instead of tokens reshaped the field, across three generations. 5. Parallel Drafting with Block Diffusion — DFlash collapses γ draft passes into one; DDTree turns that pass into a 7× verification tree. 6. Putting It to Work (this post) — Enabling EAGLE, n-gram, and Medusa speculation in vLLM and SGLang, with a guide to measuring real speedup.
---
*Speculative decoding makes an LLM generate faster while provably leaving its output unchanged — a cheap drafter proposes several tokens and the target verifies them in one parallel pass (lossless, via rejection sampling), cashing in the fact that decoding is memory-bandwidth-bound and the GPU is otherwise idle. The drafters have grown steadily more capable — separate small models, Medusa heads, feature-level EAGLE predictors, and parallel block-diffusion drafters — all to raise the accepted-tokens-per-step that drives the speedup.*
This final post is operational. The good news is that you almost never implement speculative decoding by hand anymore — the two dominant open-source inference engines, vLLM and SGLang, ship it as a configuration flag. Below is how to turn it on, how to choose a method, and how to confirm the speedup is real on *your* workload.
Version caveat. Speculative-decoding APIs in both engines have moved fast. The snippets below reflect the stable interfaces as of early 2026 (vLLM'sspeculative_configon the V1 engine; SGLang's--speculative-*flags). Always cross-check against the docs for your installed version — argument names and supported methods change between releases.
---
A decision guide: which method?
Before any code, pick a method to match your situation. The trade-offs are exactly the ones from Speculative Decoding, Formally's speedup formula $S \approx \bar k / (1 + \gamma\rho)$ and A Field Guide to Methods's taxonomy:
| If you… | Use | Why | |---------|-----|-----| | Have a small model in the same family | Draft model (vanilla SD) | Zero training, lossless, simple | | Want the best general speedup, can grab a checkpoint | EAGLE-3 | Top acceptance, plug-and-play, only a lightweight draft head to host (no separate full model) | | Generate repetitive/structured output (code, SQL, JSON, agents) | N-gram / prompt-lookup (or SuffixDecoding-style) | Free, no model, huge wins on repetition | | Are latency-bound at low batch size / low load | Any of the above | SD shines when the GPU is idle | | Are throughput-bound at high batch size | *Reconsider* | At large batches the GPU is already compute-bound; SD's extra verify work can *hurt* |
That last row is the single most important operational caveat, and it follows directly from The Need for Speed: speculation pays for the GPU's *idle* capacity. Under heavy batched load the idle capacity is gone, so the wasted-draft overhead can dominate. Speculation is primarily a latency optimization for low-to-moderate load, and most engines let you disable it dynamically as load rises.
---
vLLM
Offline (the LLM class)
In vLLM, speculative decoding is configured through a single speculative_config dict. The output is identical to non-speculative decoding for the lossless methods — only the latency changes.
EAGLE-3 draft (recommended general-purpose):
`python
from vllm import LLM, SamplingParams
llm = LLM( model="meta-llama/Llama-3.1-8B-Instruct", tensorparallelsize=1, speculative_config={ "method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B", "numspeculativetokens": 5, # draft length γ }, )
params = SamplingParams(temperature=0.0, max_tokens=256)
out = llm.generate(["Explain speculative decoding in one paragraph."], params)
print(out[0].outputs[0].text)
`
Classic separate draft model (vanilla SD):
`python
llm = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
tensorparallelsize=4,
speculative_config={
"model": "meta-llama/Llama-3.2-1B-Instruct", # the draft
"numspeculativetokens": 5,
},
)
`
N-gram / prompt-lookup (free, great for repetitive output):
`python
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
speculative_config={
"method": "ngram",
"numspeculativetokens": 5,
"promptlookupmax": 4, # longest n-gram to match
"promptlookupmin": 2, # shortest n-gram to match
},
)
`
This method drafts by copying continuations of n-grams it has already seen in the prompt + generated text — no model, no memory cost. It is astonishingly effective for summarization, code editing, and RAG, where the output echoes the input.
Online server
Everything above maps onto vllm serve with --speculative-config taking the same JSON:
`bash
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--speculative-config '{"method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B", "numspeculativetokens": 5}'
`
The server then exposes the standard OpenAI-compatible API — clients need no changes:
`bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "prompt": "def fibonacci(n):", "max_tokens": 128, "temperature": 0}'
`
---
SGLang
SGLang exposes speculative decoding through --speculative-* launch flags. Its EAGLE integration is mature and well-tuned; the three numbers to know map directly onto the tree concepts from earlier: num-steps is the draft depth, eagle-topk the branching factor, and num-draft-tokens the total tree nodes verified per step.
EAGLE-3 server:
`bash
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B \
--speculative-num-steps 5 \
--speculative-eagle-topk 8 \
--speculative-num-draft-tokens 64 \
--dtype float16
`
Checkpoint formats differ between engines. SGLang's EAGLE3 expects a SpecForge-format draft checkpoint (e.g.jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8Bor thelmsys/SGLang-EAGLE3-…-SpecForgeseries), *not* theyuhuili/EAGLE3-…checkpoints that vLLM and the reference EAGLE repo load. Using the wrong format is a common first-run failure — match the checkpoint to the engine.
EAGLE (v1/v2 checkpoints) use the same flags with --speculative-algorithm EAGLE and the matching draft checkpoint. SGLang also exposes an OpenAI-compatible endpoint on :30000 by default:
`bash
curl http://localhost:30000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "default", "prompt": "Write a haiku about GPUs.", "max_tokens": 64, "temperature": 0}'
`
Tuning the tree. The three speculative knobs trade acceptance against verification cost — exactly the tension Parallel Drafting with Block Diffusion was about. Larger num-steps and eagle-topk raise $\bar k$ but inflate num-draft-tokens and thus verification cost. Sensible starting points are num-steps 3–6, eagle-topk 4–8, num-draft-tokens 32–64; sweep them on your traffic and keep the configuration that maximizes measured tokens/second, not acceptance rate.
---
Measuring the speedup honestly
Speculative Decoding, Formally warned that acceptance rate is necessary but not sufficient — only the clock pays the bills. Always benchmark with and without speculation on *your* prompts, at *your* batch size. A minimal offline A/B:
`python
import time
from vllm import LLM, SamplingParams
prompts = [...] # a representative sample of YOUR traffic params = SamplingParams(temperature=0.0, max_tokens=256)
def bench(llm, label): t0 = time.perf_counter() outs = llm.generate(prompts, params) dt = time.perf_counter() - t0 ntok = sum(len(o.outputs[0].tokenids) for o in outs) print(f"{label}: {ntok/dt:,.1f} tok/s ({dt:.2f}s for {ntok} tokens)")
baseline = LLM(model="meta-llama/Llama-3.1-8B-Instruct") bench(baseline, "baseline")
spec = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
speculative_config={"method": "eagle3",
"model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
"numspeculativetokens": 5},
)
bench(spec, "eagle3")
`
For server deployments, drive load with a realistic concurrency sweep (e.g. vLLM's benchmarkserving.py or SGLang's benchserving) and watch per-request latency and tokens/second as a function of concurrency. You will typically see the speculative speedup *shrink* as concurrency rises — the expected behavior from The Need for Speed, and the reason many production stacks gate speculation on current load.
Checklist for an honest benchmark:
1. Use *your* prompt distribution — acceptance is workload-dependent (Speculative Decoding, Formally).
2. Match temperature to production — lower temperature usually accepts more.
3. Sweep batch size / concurrency — the speedup is load-dependent.
4. Report tokens/second and p50/p99 latency, not just acceptance rate.
5. Verify correctness: with temperature=0 and a lossless method, speculative and baseline outputs should match token-for-token.
---
Conclusion: a free lunch, served
Speculative decoding is the rare optimization that asks you to give up nothing. The output distribution is provably unchanged (Speculative Decoding, Formally); the only currency you spend is some wasted draft compute, drawn from a pool — the GPU's idle decode cycles (The Need for Speed) — that was going to waste anyway.
The throughline of this series is a single asymmetry: generation is sequential, verification is parallel, and modern hardware is starved for parallel work during decoding. Everything else — Medusa's heads, EAGLE's feature predictor, Sequoia's optimal trees, DFlash's block-diffusion drafting — is a more and more sophisticated way to manufacture good guesses cheaply and check them in bulk. The field's trajectory has been steadily toward *simpler deployment* (training-free, plug-and-play) and *broader scope* (reasoning, SSMs, retrieval, vision), and its center of gravity has moved from research code into the inference engines you can pip install today.
So the practical advice is short. Pick a method that matches your workload, turn it on with a flag, and measure the clock. For most latency-sensitive deployments, a few lines of configuration buy a 2–4× speedup for free. That is about as close to a free lunch as systems engineering ever gets.
---
References
1. vLLM — *Speculative Decoding* documentation. https://docs.vllm.ai/ (see the speculative_config / --speculative-config reference for your version).
2. SGLang — *Speculative Decoding (EAGLE)* documentation. https://docs.sglang.ai/
3. Kwon, W., et al. (2023). *Efficient Memory Management for Large Language Model Serving with PagedAttention.* ACM SOSP. (vLLM)
4. Li, Y., Wei, F., Zhang, C., Zhang, H., et al. (2025). *EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test.* arXiv:2503.01840.
5. Leviathan, Y., Kalman, M., & Matias, Y. (2023). *Fast Inference from Transformers via Speculative Decoding.* ICML. arXiv:2211.17192.
6. Saxena, A. (2023). *Prompt Lookup Decoding.* (n-gram / prompt-lookup drafting). https://github.com/apoorvumang/prompt-lookup-decoding
7. NVIDIA. *TensorRT-LLM* — speculative decoding support. https://github.com/NVIDIA/TensorRT-LLM