Broad Review of DLM architectures

June 25, 2026

*Part 2 of the Diffusion Language Models series.*

--- Once we understand that a model can denoise a sequence instead of extending it, a surprising number of design questions open up that autoregression never had to answer:

How much of the sequence do you corrupt at once: all of it, or a block at a time?

Do you train from scratch, or adapt an existing autoregressive model?

How do you decide *which* positions to commit on each pass?

And when you scale, do you make the dense transformer bigger, or do you go sparse with a mixture of experts?

The cleanest way to see these tradeoffs is to put four points on the map: Dream, LLaDA, block diffusion, and the MoE turn taken by LLaDA 2.0 and DiffusionGemma. They are not four flavors of the same thing. They sit at genuinely different places on the same design space.

---

The Axis That Matters: How Much Do You Mask At Once?

Forget the marketing for a moment. The single most important architectural choice in a diffusion LM is the granularity of corruption, and it lives on a spectrum:

`text fully masked <-------------------------------------> fully autoregressive (whole sequence block-by-block one token, denoised together) denoising left to right) LLaDA, Dream block diffusion GPT / Llama / classic Gemma `

At the left edge, the entire output region is masked and refined together over many global passes. At the right edge, you have ordinary next-token prediction, which is just diffusion with a block size of one and a fixed left-to-right schedule. Everything interesting in diffusion LMs is a position *between* those two poles.

So the real question for any DLM is: where on this axis does it sit, and what did it trade to get there?

---

LLaDA: Fully Masked, Any-Order Denoising

LLaDA (*Large Language Diffusion with mAsking*) is the clean reference point for the left edge of the axis.

The corruption process is deliberately simple. Take a clean sequence, sample a masking ratio $t \in (0, 1]$, and independently replace each token with [MASK] with probability $t$. At $t \approx 1$ almost everything is masked; at $t \approx 0$ almost nothing is. The model is trained to predict the original tokens at every masked position, with a loss that is averaged over the random masking ratio:

`text clean: The patient reports chest pain with exertion . t = 0.5: The [MASK] reports [MASK] pain [MASK] exertion [MASK] target: patient, chest, with, . `

Two properties fall out of this design and they define LLaDA's character.

It is bidirectional and order-free. The model attends over the whole sequence with no causal mask, and because the mask is sampled independently per position, there is no privileged generation order. The model is, in effect, trained to fill any subset of positions given any other subset. That is exactly the infilling-is-native story from Part 1, baked into the objective rather than bolted on.

Generation is a global remasking loop. At inference, you start from an all-masked output region and run a fixed number of denoising steps. Each step the model predicts every masked position at once; you *keep* the most confident predictions and *remask* the rest, so the sequence sharpens over successive passes:

`text pass 1: [MASK] [MASK] reports [MASK] [MASK] [MASK] [MASK] pass 2: The [MASK] reports cough [MASK] fever [MASK] pass 3: The patient reports cough and fever today . `

The number of steps is a dial you control independently of sequence length. That decoupling, of compute from length, is the whole appeal. The cost is that LLaDA has no notion of "finish the start before the end." It is committing to a fixed-size canvas and polishing all of it, which makes streaming awkward and makes length handling something the model has to learn rather than something the decoding loop gives you for free.

LLaDA's contribution was less a new idea than a *demonstration*: a from-scratch masked diffusion model, trained at 8B scale, that is genuinely competitive with autoregressive models of similar size on standard benchmarks. It made the left edge of the axis look like a real place to build, not just a toy.

---

Dream: Same Edge, But Adapted From an Autoregressive Model

Dream (Dream 7B) lives at roughly the same masked-diffusion edge as LLaDA, but it answers a different question: do you have to train a diffusion LM from scratch?

The expensive thing about LLaDA is that all those trillions of pretraining tokens were spent learning a *new* objective. Dream's bet is that most of that knowledge already exists, in good autoregressive checkpoints, and can be ported. Dream is initialized from autoregressive weights (the Qwen family) and then *adapted* into a masked diffusion model, rather than pretrained on the diffusion objective from zero.

This reframes the architectural relationship between AR and diffusion. They are not two species; they are two *objectives over the same transformer*, and you can move a set of weights from one to the other if you are careful about:

The attention mask. Causal during the AR phase, bidirectional during the diffusion phase. The weights survive the switch better than you'd expect, because attention patterns are largely re-learnable.

The token-shift mismatch. AR models predict position $i{+}1$ from position $i$; masked diffusion predicts position $i$ in place. Reconciling that offset is part of what the adaptation has to fix.

The noise schedule used during adaptation, which controls how aggressively the model is pushed away from its left-to-right prior.

The practical upshot: Dream reaches strong quality for a fraction of the from-scratch compute, and it inherits the maturity of the AR ecosystem it was born from. Dream and LLaDA end up at similar points on the masking axis, but by opposite roads, one paved from scratch, one paved by conversion. That "adapt, don't retrain" line is worth holding onto, because it is exactly how the broader field is likely to get diffusion variants of models it already has.

| | LLaDA | Dream | |---|---|---| | Position on axis | Fully masked | Fully masked | | Origin | Trained from scratch on the diffusion objective | Adapted from an autoregressive checkpoint | | Attention | Bidirectional | Bidirectional (re-learned from causal) | | Main claim | Diffusion can scale on its own | You can convert AR knowledge cheaply | | Cost paid | Full pretraining compute | Adaptation + reconciling the AR/diffusion mismatch |

---

Block Diffusion: Buying Back the Stream

Both LLaDA and Dream pay the same tax: because the whole sequence is denoised together, there is no natural "first part is done, start showing it" moment, and no KV cache, since bidirectional attention over a changing sequence cannot be cached the way causal attention can.

Block diffusion (the BD3-LM line of work) is the explicit compromise on the axis. Split the sequence into blocks. Generate autoregressively across blocks but diffusively within each block:

`text Block 1: [denoise these tokens in parallel] --> commit Block 2: [denoise in parallel, conditioned on block 1] --> commit Block 3: [denoise in parallel, conditioned on blocks 1-2] --> commit `

This is the interpolation point, and it recovers most of what the pure-masked edge gave up:

Streaming comes back, because completed blocks can be emitted while later blocks are still masked.

The KV cache comes back, because attention *across* blocks is causal and cacheable; only the within-block attention is bidirectional and recomputed.

Length handling gets easier, because you can stop generating blocks when the content is done instead of committing to a fixed canvas up front.

What you give up is global order-freedom: a token in block 1 can no longer be revised in light of block 3, because block 1 was committed first. You have re-introduced a coarse left-to-right dependency, just at the block level instead of the token level. That is the whole point. Block size becomes the knob that slides you along the axis: block size of one is pure AR, block size of "everything" is pure LLaDA, and the interesting regime is in between.

Block diffusion matters less as a single model and more as the *mechanism* that later production systems reach for when they need diffusion's parallelism without losing the inference tricks that make AR cheap to serve.

---

The MoE Turn: LLaDA 2.0 / 2.1 and DiffusionGemma

Everything above is about the *decoding contract*. The most recent move is orthogonal to it, and it is about where the parameters live.

So far the implicit assumption was a dense transformer: every token passes through every parameter on every pass. Diffusion makes that assumption more painful than usual, because a diffusion LM does *many* forward passes over the sequence to produce one output. If each of those passes lights up the entire dense network, the parallelism win on the *depth* axis is partly eaten by repeated full-network compute on the *step* axis.

Mixture-of-experts is the natural answer. Replace the dense feed-forward blocks with many expert sub-networks and a router that sends each token to only a few of them. You get a large *total* parameter count, much of it active knowledge, but a small *active* parameter count per token per pass.

That combination is unusually well matched to diffusion:

A diffusion LM already amortizes generation over multiple passes. Making each pass cheaper (sparse activation) directly attacks the part of the cost story that critics point at.

Different denoising stages, early "rough structure" passes versus late "polish" passes, plausibly want different computation, and a router gives the model room to specialize without paying for all experts every time.

This is the line LLaDA 2.0 (and the 2.1 / mini-CAP variants) takes: a block-diffusion-style masked model whose backbone is a sparse MoE rather than a dense stack, paired with confidence-aware parallel decoding that decides how many tokens to safely accept per pass. The "CAP" idea, confidence-aware parallelism, is the natural partner to MoE here: sparse experts make each pass cheap, and confidence-based acceptance makes each pass *commit more*, so you need fewer of them.

DiffusionGemma sits in the same conceptual neighborhood from the other direction: a member of a well-known, ecosystem-heavy model family stepping onto the diffusion side of the axis. Whatever the exact internals, its significance is the same as Dream's was, it signals that diffusion decoding is moving from research curiosity toward the part of the stack where people ship things.

So the recent frontier is really two independent choices stacked on top of each other:

`text Choice 1 (decoding): fully masked --> block diffusion --> AR Choice 2 (parameters): dense --> mixture of experts `

LLaDA and Dream explored choice 1 with a dense backbone. LLaDA 2.0/2.1 and the DiffusionGemma-era models keep moving along choice 1 (toward blockwise) while *also* turning the dial on choice 2 (toward sparse). They are not competing answers to one question. They are progress on two questions at once.

---

The Map, In One Table

| Model / idea | Masking granularity | Backbone | Training origin | Headline tradeoff | |---|---|---|---|---| | Classic Gemma / Llama (AR) | One token, left-to-right | Dense | From scratch | Streams perfectly, cannot revise | | LLaDA | Fully masked | Dense | From scratch | Order-free infilling, no streaming/KV cache | | Dream | Fully masked | Dense | Adapted from AR (Qwen) | Cheap to obtain, inherits AR's prior | | Block diffusion (BD3-LM) | Block-by-block | Dense | From scratch | Buys back streaming + KV cache, loses global revision | | LLaDA 2.0 / 2.1 | Block-by-block | MoE | From scratch | Cheaper passes via sparsity + confidence-aware acceptance | | DiffusionGemma-era | Block / masked | Dense or MoE | Family-derived | Diffusion entering the mainstream ecosystem |

The thing to notice is that the table has *two* axes doing the work, granularity and parameter layout, and almost none of the interesting models are at a corner. The design space is the point. Autoregression occupies exactly one cell of it. Diffusion opened up the rest.

---

Where This Leaves Us

Part 1 made the case that diffusion treats language as an object under repair. This post is the consequence of taking that seriously: once decoding is repair rather than continuation, you inherit a whole family of design decisions that AR collapsed into a single default.

How much you mask at once (LLaDA/Dream's full masking vs. block diffusion) trades global revisability against streaming and caching.

How you obtain the weights (LLaDA from scratch vs. Dream by adaptation) trades compute against ecosystem inheritance.

How you lay out parameters (dense vs. MoE in LLaDA 2.0/2.1 and the DiffusionGemma era) trades total knowledge against per-pass cost.

None of these has a settled winner, and that is the healthy part. Autoregression won the first era by being a single, brutally effective point in design space. Diffusion's contribution may turn out to be the *space itself*, a set of dials, masking granularity, training origin, sparsity, confidence thresholds, that let a model be tuned to the shape of the task instead of forcing every task through a left-to-right stream.

The next post in the series gets concrete: small, laptop-runnable experiments that make these dials visible, so you can watch a masked sequence sharpen, see what a block boundary actually buys, and feel where each architecture earns its tradeoff.

---

References

*LLaDA: Large Language Diffusion with mAsking* (2025)

*Dream 7B* — diffusion LM adapted from autoregressive (Qwen) weights (2025)

Arriola et al., *Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models* (BD3-LM, 2025)

*LLaDA 2.0 / LLaDA 2.1 / LLaDA2.0-mini-CAP* — block masked diffusion with mixture-of-experts backbone and confidence-aware parallel decoding

Austin et al., *Structured Denoising Diffusion Models in Discrete State-Spaces* (2021)

Shi et al., *Simplified and Unified Masked Diffusion LMs* / MDLM (2024)

Shazeer et al., *Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer* (2017)