Why Diffusion LLM Quantization Is Harder Than It Looks

June 25, 2026

A diffusion language model exposes the same modules you have quantized a hundred times: attention and MLP projections, embeddings, a few norms, an output head [5, 6]. So the default move is to reuse the autoregressive (AR) toolbox — GPTQ for the weights, AWQ, SmoothQuant, or QuaRot once activations also go low-bit [8, 9, 10, 11] — calibrate on a few hundred sequences of text, and quantize.

It does not transfer. AWQ on LLaDA-8B at W4A4 loses more than 16 accuracy points [1], with no architectural warning: identical layer shapes, identical kernels. The reason is that quantization error is governed not by the modules but by the distribution of the tensors that flow through them — and AR and diffusion inference induce very different distributions over identically shaped layers.

What AR quantization relies on

Every PTQ method reduces to approximating one matmul. A linear layer computes $Y = XW$ with $X\in\mathbb R^{n\times d\text{in}}$ and $W\in\mathbb R^{d\text{in}\times d_\text{out}}$, and quantization replaces it with $\hat Y = QX(X)\,QW(W)$. The workhorse is the uniform quantizer,

$ Qs(x) = s\cdot\operatorname{clip}\!\Big(\big\lfloor x/s\big\rceil,\; q{\min},\; q_{\max}\Big), \qquad [q{\min},q{\max}] = [-2^{b-1},\, 2^{b-1}-1], $

so signed INT4 lands on the grid $[-8, 7]$ and the only freedom is the scale $s$ (plus a zero point $z$ for asymmetric activations). Each method is a different way of choosing those scales to solve the same layer-reconstruction problem,

$ \min{\hat W}\ \big\|XW - X\hat W\big\|F^2. $

GPTQ minimizes it column by column using the OBQ Hessian $H = X^\top X$: after rounding column $q$, it pushes the induced error into the not-yet-quantized columns,

$ \delta = -\,\frac{wq - Q(wq)}{\big[H^{-1}\big]{qq}}\,H^{-1}{:,q}, $

which is the only place the calibration activations $X$ enter [8]. AWQ rescales channels by activation magnitude [9]; SmoothQuant migrates activation outliers into the weights via a diagonal reparametrization, $X\!\to\!X\operatorname{diag}(s)^{-1}$ and $W\!\to\!\operatorname{diag}(s)\,W$, leaving $XW$ unchanged [10]; QuaRot inserts an orthogonal $R$ with $R^\top R = I$ so that $XW = (XR)(R^\top W)$ is exact while the Hadamard rotation spreads the outliers in $XR$ [11].

All of it rests on one assumption: $X\text{calib}\approx X\text{infer}$, so that the empirical objective $\mins \mathbb E{X\sim p\text{calib}}\|XW - Qs(X)Q(W)\|^2$ is a faithful proxy for deployment. For AR models it holds because inference is a stationary causal process — $xt\sim p\theta(\cdot\mid x_{ s activations on a sequence one token longer. The marginal distribution a layer sees is essentially fixed, so a frozen calibration set samples it well.

The diffusion inference loop

A masked diffusion LM does not extend a prefix. It initializes all $L$ positions to a mask token $[M]$ and iterates a denoiser $p\theta(x0\mid x_t)$ over steps $t = T,\dots,1$ [5, 6]. The per-position state is

$ x_t^i = \begin{cases} [M], & i \text{ masked at step } t,\\ x_0^i, & i \text{ already committed}, \end{cases} $

the forward process masks each token independently with probability $t\in[0,1]$, and the mask ratio $\rhot = \tfrac{1}{L}\sumi \mathbf 1[x_t^i = M]$ falls as decoding proceeds. Each step scores all masked positions, commits the highest-confidence subset, and re-masks the rest; block-diffusion models such as LLaDA2 add a block index $b$ selecting the active region [7].

There is therefore no single activation distribution. A layer $\ell$ sees a family

$ X_\ell\big(t,\rho,b,\pi\big), $

indexed by step $t$, mask ratio $\rho$, active block $b$, and commit/remask policy $\pi$. The AR proxy collapses: fully visible calibration text concentrates near $\rho\approx 0$, while inference spends most of its steps at large $\rho$. The honest calibration objective is an expectation over the schedule,

$ \mins\ \mathbb E{t}\,\mathbb E{X\sim pt}\big\|XW - Q_s(X)\,Q(W)\big\|^2, $

and fitting the $\rho\approx 0$ slice alone optimizes the wrong distribution. Four consequences follow.

Where the assumptions break

1. Calibration covers the wrong support. Empirically, the activation range of LLaDA's first block drifts monotonically across $t$ — adjacent steps similar, distant ones nearly disjoint [1] — so a single $s$ chosen at $\rho\approx 0$ cannot cover the high-$\rho$ regimes the model actually occupies.

2. Error compounds along the trajectory. In a feed-forward pass quantization error is local. In diffusion the step output is the next step's input, $x{t-1} = g\big(xt, \hat f\theta(xt)\big)$, so with $L(x_t)$ the accumulated error at step $t$, DLLMQuant's recurrence is

$ L(xt) = xt - \operatorname{Deq}\!\big(Q(xt + L(x{t+1}))\big). $

Linearizing, $L(xt)\approx \epsilont + Jg\,L(x{t+1})$ with per-step quantization noise $\epsilon_t$ and Jacobian $J_g = \partial g/\partial x$, hence

$ \big\|L(xt)\big\| \;\lesssim\; \sum{\tau > t} \|Jg\|^{\,\tau - t}\,\|\epsilon\tau\|, $

geometric growth rather than a one-shot perturbation [1]. DLLMQuant localizes most of $\epsilon_t$ to a single op, the $\operatorname{softmax}(QK^\top/\sqrt d)\,V$ product; holding that matmul in high precision flattens the curve. It also separates the two regimes: with FP16 activations $\hat Y = X\,Q_W(W)$ injects weight error once, whereas $\hat Y = QX(X)\,QW(W)$ injects activation error on each of $T$ passes — which is why W4A16 is stable and W4A4 is a cliff.

3. Attention reweights activation error. With $A = \operatorname{softmax}(QK^\top/\sqrt d)$ and $Oi = \sumj A{ij} Vj$, quantizing the values gives

$ \Delta Oi = \sumj A{ij}\big(Q(Vj) - V_j\big), \qquad \|\Delta Oi\| \le \sumj A{ij}\,\|Q(Vj) - V_j\|. $

The output error is the *attention-weighted* value error, not the raw one. In LLaDA $A$ is sharply peaked (mass on the diagonal and a few tokens) while $V$ varies strongly across channels and tokens [1], so an unweighted quantizer that equalizes $\|Q(Vj) - Vj\|$ pours precision into columns $A$ multiplies by near-zero.

4. The Hessian over-weights inert tokens. GPTQ compensates with $H = X^\top X$, weighting every token row equally. But each row is one of: committed and frozen (error does not propagate), masked and low-confidence (likely re-masked), or masked and high-confidence (about to enter the context every later step reads). Only the last feeds forward, so a uniform $H$ spends its compensation budget on rows with no downstream effect.

Schedule-aware fixes

DLLMQuant [1] addresses cracks 1–4 as three modules over a GPTQ/QuaRot baseline.

Temporal-Mask Adaptive Sampling (TMAS) replaces text sampling with state sampling. With $B$ blocks and $T$ steps it sets block size $s = \lfloor T/B\rfloor$, classifies each captured state into one of four mask-ratio bins split at $\{0.2, 0.5, 0.8\}$, and fills per-block quotas with target proportions $p = n\,[0.3, 0.2, 0.2, 0.3]$ — over-sampling the $\rho\!\to\!1$ and $\rho\!\to\!0$ extremes where the distribution is most distinct. The calibration set then approximates $\mathbb Et\,pt$ rather than $p_{t\approx 0}$, and on its own recovers most of the INT4 drop [1].

Interaction-Aware Activation Quantization (IA-AQ) quantizes $V$ against the attention-weighted objective from crack 3,

$ \mins\ \sum{i,j} A{ij}\,\big\|Vj - Qs(Vj)\big\|^2, $

searching the scale on a grid around the default range $\hat s = (V{\max} - V{\min})/(q{\max} - q{\min})$, with $s = \alpha^\star\hat s$ and $\alpha^\star = \arg\min_{\alpha\in\{1.0,\,0.8\}} L(\alpha\hat s)$ [1].

Certainty-Guided Quantization (CGQ) replaces the flat Hessian with a certainty-weighted one. Define a per-token weight folding in mask state and confidence score $sc_i$,

$ \omegai = \big(\mathbf 1[xt^i = M]\cdot 1 + \mathbf 1[xt^i \neq M]\cdot 0.7\big) + \sqrt{sci}, \qquad \tilde X = \operatorname{diag}(\omega)\,X, $

so that

$ H = \tilde X^\top \tilde X = X^\top \Omega X, \qquad \Omega = \operatorname{diag}(\omega^2), $

concentrating compensation on masked, high-confidence rows — crack 4 [1]. Together the three turn a 16-point loss into a $>$10-point GSM8K gain over baselines, preserve reasoning that plain QuaRot drops, and run $\sim$1.6× faster at $\sim$3.2× less memory [1].

Quant-dLLM [3] pushes weight-only to 2 bits. It keeps schedule-aware calibration (Masked Calibration Simulation, the MCS analogue of TMAS) but changes the weight code: instead of one 2-bit grid it writes each matrix as an order-$K$ sum of row–column-scaled binary matrices,

$ \hat W = \sum{i=1}^{K} \operatorname{diag}(ui)\,B^{(i)}\,\operatorname{diag}(v_i), \qquad B^{(i)}\in\{-1,+1\}^{d\text{in}\times d\text{out}}, $

fit to the simulated calibration statistics; $K$ binary planes cost $\approx K$ bits, so $K = 2$ holds the budget while fitting masked activations better than a fixed code [3]. Adaptive Blockwise Mixed Precision then assigns per-block order $K_b$ by a sensitivity score under the average constraint $\frac{1}{B}\sumb Kb \le 2$, spending extra planes on the blocks that drive late-step error — the bit-budget analogue of CGQ.

The asymmetry: weights are easy, activations are not

After all of that, the weight side is the forgiving one. A systematic sweep finds 4-bit weight-only nearly lossless across LLaDA and Dream (GPTQ $>$ AWQ), with the sharp degradation reserved for low-bit activations [2]. The mechanism is structural [4]: LLaDA carries a single super-outlier channel $j^\star$ whose magnitude dominates the hidden state,

$ h\ell \approx c\,e{j^\star} + \tilde h\ell, \qquad |c| \gg \|\tilde h\ell\|, $

behaving as a learned constant — prune $j^\star$ and generation collapses into repetitive token loops. Around it the early layers are highly redundant (high inter-layer representation similarity), the reverse of the AR pattern where deep layers degenerate, which the authors attribute to over-training.

That redundancy is slack the weight quantizer can spend: 3-bit GPTQ costs LLaDA $<$2 points on GSM8K but costs Llama-3.1-8B $\approx$65 [4]. It also inverts pruning heuristics — under a 50% average-sparsity budget, allocating more sparsity to early layers beats the reverse by $\approx$8.4 points on LLaDA and loses $\approx$8.4 on Llama [4]. The AR rules of thumb are not merely loose here; their sign flips. The hard variables are not $W$ but the activation scales and the calibration support — exactly the quantities coupled to the denoising schedule.

From fake quantization to real kernels

A fake-quantized score — simulate $x \to Q(x) \to \operatorname{Deq}(Q(x))$, then compute in FP16 — bounds recoverable accuracy but not latency. A real speedup needs packed INT4/FP4 weights, low-bit matmul and fast Hadamard/rotation kernels, and the activation quantizer fused into the loop. The loop is the multiplier: emitting $L$ tokens costs $O(T)$ full forward passes, so a per-pass overhead $\Delta$ scales to $T\Delta$, and MoE variants add per-pass routing across many expert matrices. DLLMQuant's $\sim$1.6×/$\sim$3.2× hold only because the kernels are real and the trajectory survives them [1].

Summary

The AR toolbox is the right substrate — GPTQ, AWQ, SmoothQuant, QuaRot — and on weights it carries diffusion models further than AR ones [2, 4]. What changes is the quantized object. It is not the per-layer map $W \to Q(W)$ but the closed loop

$ x{t-1} = g\big(xt,\, Q(f\theta)(xt)\big), \qquad xT \xrightarrow{Q(f\theta)} \cdots \xrightarrow{Q(f\theta)} x0, $

whose objective is the end-to-end $\minQ \mathbb E\big[\,d(x0^{Q}, x_0)\,\big]$, not a single Frobenius residual. The working methods each reduce one piece of that loop to a tractable surrogate: TMAS/MCS calibrate over $\mathbb Et\,pt$, IA-AQ weights value error by $A$, CGQ weights $H$ by mask-certainty, ABMP allocates bits by step-sensitivity [1, 3]. Same modules, different inference distribution — and a quantizer blind to the schedule minimizes the wrong objective with full confidence.

References

1] Chen Xu and Dawei Yang, [DLLMQuant: Quantizing Diffusion-based Large Language Models, 2025.

2] Haokun Lin et al., [Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs, 2025. Official code: MessiX77/QDLM.

3] Tianao Zhang et al., [Quant-dLLM: Post-Training Extreme Low-Bit Quantization for Diffusion Large Language Models, 2025. Official code: ZTA2785/Quant-dLLM.

4] Alexander Conzelmann, Albert Catalan-Tatjer, and Shiwei Liu, [Layer Collapse in Diffusion Language Models, 2026. Official code: Conzel/super-outlier-dlm.

5] Shen Nie, Fengqi Zhu, Zebin You, et al., [Large Language Diffusion Models, 2025.

6] Jiacheng Ye, Zhihui Xie, Lin Zheng, et al., [Dream 7B: Diffusion Large Language Models, 2025.

7] Tiwei Bie, Maosong Cao, Kun Chen, et al., [LLaDA2.0: Scaling Up Diffusion Language Models to 100B, 2025.

8] Elias Frantar et al., [GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers, 2022.

9] Ji Lin et al., [AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration, 2023.

10] Guangxuan Xiao et al., [SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models, 2022.

11] Salaheddin Ashkboos et al., [QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs, 2024.

12] Haokun Lin et al., [DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs, 2024.