Diffusion vs Autoregression: Why Language Models May Not Need to Think Left to Right

June 25, 2026

Diffusion vs Autoregression: Why Language Models May Not Need to Think Left to Right

*Part 1 of the Diffusion Language Models series.*

---

For most of the LLM era, one design choice has been treated as settled: language models write one token after another. This has been the case for GPT-style models, Llama-style models, Gemma-style models(until a couple days ago!) The default has been the same. Given a prefix, predict the next token. Append it. Repeat.

Diffusion language models (DLMs) have a different architecture. They do not begin with an empty page and extend it from left to right. They begin with a corrupted, masked, or incomplete sequence and repeatedly repair it. The output does not arrive as a chain of irreversible commitments. It arrives as a whole object being refined.

LLaDA showed that a pure masked-diffusion model trained from scratch can compete with autoregressive models of similar scale. Inception Labs' Mercury made the speed case commercially (upto a 1000 token/s), Google previewed Gemini Diffusion, and now DiffusionGemma (a 26B MoE block-diffusion model). No single model proves diffusion will replace autoregression. It will not, at least not everywhere. But the steady accumulation of credible models is a welcome change to the norm of AR models.

So let us compare the two ways of generating text.

---

How Autoregression Generates

An autoregressive (AR) language model factorizes text from left to right:

$ p(x) = p(x1)\,p(x2 \mid x1)\,p(x3 \mid x1, x2) \cdots p(xn \mid x1, \dots, x{n-1}) = \prod{i=1}^{n} p(xi \mid x{

At training time, this is simple. Feed the model the true prefix and train it to predict the next token.

At inference time, the simplicity becomes a constraint:

`text step 1: predict token 1 step 2: predict token 2, conditioned on token 1 step 3: predict token 3, conditioned on tokens 1 and 2 ... step N: predict token N, conditioned on everything before it `

One forward pass gives you one new token.

That is the core bargain of autoregression. It streams well. It gives exact next-token likelihoods. It fits naturally with chat interfaces. It also locks generation into a causal order.

Token 4 cannot know what token 40 will be. Token 40 can depend on token 4, but token 4 cannot be revised unless you regenerate the suffix or add an external editing loop. The model is always writing forward, even when the task itself is not forward-shaped.

This matters more than it first appears, because many of the outputs we ask for are not forward-shaped. Take a JSON tool call: a "summary" field near the top may need to be consistent with "line_items" that appear hundreds of tokens later. The object's parts constrain each other globally, but AR decoding must commit to the summary before it has written the items. The same shape shows up in code (a function signature constrained by its body), clinical notes (an assessment constrained by the plan), and forms with mutually dependent fields.

Autoregressive models handle this with scale, instruction tuning, tool loops, rejection sampling, and retries.

---

How Diffusion Generates

Diffusion language models ask a different question.

Autoregression asks:

`text What token comes next? `

Diffusion asks:

`text What would make this corrupted sequence less corrupted? `

In image diffusion (Ho et al., 2020; Song et al., 2021), the model starts from noise and gradually denoises toward an image. In language diffusion, the continuous Gaussian noise is usually replaced with discrete corruption (Austin et al., 2021): tokens are masked, replaced, or otherwise damaged, and the model learns to reconstruct the original sequence.

A masked diffusion language model might begin like this:

`text Prompt: Summarize this visit as a SOAP note.

Output: [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] `

Then, instead of predicting only the first missing token, the model predicts many positions at once:

`text pass 1: [MASK] patient [MASK] cough [MASK] fever [MASK] pass 2: The patient reports cough and fever for [MASK] days. pass 3: The patient reports cough and fever for three days. `

The important part is not the [MASK] token itself. The important part is that the model can condition on both sides of a position while generating. The model does not have to pretend the future is invisible.

An AR model must choose the next token now. A diffusion model can leave uncertain regions unresolved, fill the easy parts first, and come back with more context. The sequence becomes a working draft rather than a one-way transcript.

---

The Architecture Difference

A diffusion LM can still use embeddings, attention layers, MLPs, residual streams, normalization, and output logits. The architectural break is mostly in two places:

1. The attention mask 2. The training objective

Autoregressive models use causal attention:

`text token 4 can see: token 1, token 2, token 3, token 4 token 4 cannot see: token 5, token 6, token 7... `

Diffusion language models typically use bidirectional attention over the visible/corrupted sequence:

`text token 4 can see: token 1, token 2, token 3, token 4, token 5, token 6... `

That is the heart of the difference. AR models model continuation. DLMs model reconstruction.

The training objective changes accordingly. Instead of training only on "predict the next token," the model is trained to recover clean tokens from corrupted sequences across different noise levels (Sahoo et al., 2024.

In a simplified masked setup:

`text clean: The patient reports chest pain with exertion. corrupted: The [MASK] reports [MASK] pain with [MASK]. target: patient, chest, exertion `

The model learns: given the damaged object, infer the missing parts.

At inference time, generation becomes iterative:

`text 1. Keep the prompt visible. 2. Initialize the output region as masked. 3. Predict all masked positions. 4. Accept the high-confidence positions. 5. Keep uncertain positions masked. 6. Repeat until the sequence stabilizes. `

LLaDA is a clean example of this style, applied over the full sequence at once.

Block diffusion (Arriola et al., 2025) is a hybrid: autoregressive *across* blocks, diffusion *within* them. The sequence is divided into fixed-size chunks generated left to right, each conditioned on the completed blocks before it — which means the prefix can be KV-cached, like in AR decoding — while the tokens inside the current block are denoised in parallel. This recovers some streaming behavior and inference efficiency without giving up parallel generation. It is the recipe behind LLaDA2.0's confidence-aware parallel decoding and, notably, DiffusionGemma — a sign that this family of ideas is moving from research curiosity to production architecture.

---

Why DLMs Are Stronger Than They Look

1. They Generate Objects, Not Just Streams

As the JSON example earlier showed, many important outputs — tool calls, SQL queries, clinical notes, code files, forms, plans with dependencies — are objects with internal constraints, where the end can constrain the beginning.

AR can learn these dependencies, but decoding still exposes them one token at a time. DLMs can refine the whole object. That is a better fit.

2. Bidirectional Context Is Native

Infilling is awkward for pure AR models. You can add fill-in-the-middle training, special tokens, or editing tools, but the base decoding direction is still causal.

For DLMs, infilling is not a special case. It is the core task.

`text The patient was started on [MASK] after the culture grew E. coli. `

The model can use both the left context and the right context immediately. It does not need to generate the left half, then hope the right half still works. Reconstruction is the job.

This matters for editing, code repair, report completion, and any workflow where users want to change the middle without regenerating everything after it.

3. Parallelism Attacks the Sequential Bottleneck

Autoregressive decoding has a hard dependency chain. Even with a KV cache, token 500 cannot be generated before token 499 exists.

DLMs can generate multiple positions per forward pass. If a 512-token output converges in 64 denoising passes, the model has reduced the sequential depth by $8\times$. Wall-clock speedup will not be $8\times$, because each denoising pass is heavier than a cached AR step. But the direction is important: DLMs attack the bottleneck AR cannot remove.

This is why the speed story should not be dismissed too quickly. GPUs like parallel work. AR decoding is full of small sequential steps. Diffusion decoding offers a path to fewer, larger, more parallel steps.

4. Revision Is Built Into the Mental Model

AR models can revise, but revision is external to the decoding process. You ask the model to rewrite, or you regenerate a suffix, or you run another tool loop.

DLMs make revision feel native:

`text generate draft validate remask bad spans denoise again validate again `

That loop is especially natural for structured output. If a JSON field violates a schema, remask the field. If a code line fails a test, remask the region. If a note section contradicts the assessment, remask the assessment and plan.

The model is not just producing text. It is repairing an object.

5. They Separate "Knowing" From "Committing"

This may be the deepest advantage.

AR models must commit in order. DLMs can know that some regions are uncertain and delay committing them. That lets easy tokens crystallize early while hard tokens wait for more context.

That sounds small, but it changes the texture of generation. It is closer to drafting than speaking. Humans often write this way: rough structure first, fill details later, revise the inconsistent parts. DLMs make that workflow architectural instead of procedural.

---

The Honest Tradeoffs

DLMs are promising, not magical.

Streaming Is Worse

AR models are excellent at streaming because the first token appears immediately. DLMs often need a whole block to stabilize before showing useful text. Block diffusion can soften this, but AR still owns the classic chat-streaming experience.

Sampling Is More Complicated

AR decoding has familiar controls: temperature, top-p, top-k, repetition penalties, beam search. DLM sampling has more knobs: number of denoising steps, masking schedules, confidence thresholds, remasking strategies, block sizes.

That complexity is not fatal, but it is real.

KV Cache Is a Huge AR Advantage

Cached AR inference is highly optimized. A full-sequence diffusion pass attends over the whole block repeatedly, so fewer sequential steps do not automatically mean proportional latency gains. Block diffusion narrows this gap by making the completed prefix cacheable, but within-block passes remain heavier than cached AR steps.

The DLM speed case depends on implementation, sequence length, hardware utilization, and how many tokens can safely be accepted per pass.

Scaling Evidence Still Favors AR

The strongest language model scaling evidence belongs to autoregressive models. The biggest production systems, the most polished inference stacks, and the deepest ecosystem are AR-first.

DLMs have to earn their place.

But the important thing is that the tradeoff is no longer silly. It is not "AR works and diffusion does not." It is:

`text AR is better for streaming continuation. DLMs may be better for parallel, structured, editable generation. `

That second category is not niche. It includes a lot of what people actually want LLMs to do.

---

Where This Leaves Us

Autoregressive models won the first era of LLMs because next-token prediction is simple, scalable, and brutally effective. Nothing in the diffusion story changes that. What has changed is that the left-to-right decoding order — the one piece of the stack everyone forgot was optional — now has a credible alternative, and not just on paper. LLaDA proved the training recipe works at scale, block diffusion gave it a practical inference story, and models like Mercury, Gemini Diffusion, and DiffusionGemma are carrying it toward production.

The case for DLMs is not that they are universally better. It is narrower and, I think, stronger: they are architecturally aligned with outputs that need global consistency, parallel construction, infilling, and revision. AR treats language as a stream; diffusion treats it as an object under repair. For chat, the stream is often enough. For tools, code, reports, plans, and structured reasoning, the object may turn out to be the better abstraction — and that is the question the rest of this series will try to answer empirically.

---

References

  • Ho, Jain & Abbeel, *Denoising Diffusion Probabilistic Models* (2020) — arXiv:2006.11239
  • Song, Meng & Ermon, *Denoising Diffusion Implicit Models* (2021) — arXiv:2010.02502
  • Austin et al., *Structured Denoising Diffusion Models in Discrete State-Spaces* (2021) — arXiv:2107.03006
  • Sahoo et al., *Simple and Effective Masked Diffusion Language Models* (MDLM, 2024) — arXiv:2406.07524
  • Nie et al., *Large Language Diffusion Models* (LLaDA, 2025) — arXiv:2502.09992
  • Arriola et al., *Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models* (2025) — arXiv:2503.09573
  • *LLaDA2.0 / LLaDA2.0-mini-CAP* — block diffusion + confidence-aware parallel training