Apple Foundation Model 3, what even is it?

June 9, 2026

Rishi P

Apple Foundation Model 3, what even is it?

On June 8, 2026, in the middle of a WWDC keynote, Apple published Introducing the Third Generation of Apple's Foundation Models. Buried in it is an architecturally interesting on-device model. This is a 20-billion-parameter language model that runs on a phone by making some really clever architectural changes.

This post walks through what AFM 3 actually is, how its flagship on-device model differs from a classic Mixture-of-Experts, why those differences are exactly the right ones for a phone, and then, where the per-prompt routing approach(discussed below) is likely to crack.

The lineup: five models, two homes

AFM 3 is not one model. It's a family of five, split across two execution environments:

On-device:

AFM 3 Core — a 3B-parameter *dense* model, successor to the ~3B model Apple has shipped since 2024. Lightweight text tasks; what third-party apps get via the Foundation Models framework.

AFM 3 Core Advanced — A 20B-parameter *sparse*, natively multimodal model (new expressive TTS voices, higher-accuracy dictation) that activates only 1 to 4 billion parameters per request. This is gated to Apple's most capable silicon.

On Private Cloud Compute:

AFM 3 Cloud — the server-side text/image-understanding workhorse.

ADM 3 Cloud — image generation and editing (Image Playground, Genmoji, photo tools).

AFM 3 Cloud Pro — the most capable model, for agentic tool use and complex reasoning. It runs on NVIDIA GPUs inside Google Cloud.

This family of models was built "in collaboration with Google". It was pre-trained at scale on cloud TPUs. Gemini is a teacher signal in post-training, not the runtime model.

The problem AFM 3 Core Advanced exists to solve

Every on-device LLM to date has lived under one constraint: the whole model has to fit in DRAM. A phone has maybe 8–12 GB of RAM, most of it reserved by the OS and apps. This is why on-device models have been stuck in the 1–4B range, quantized hard. Phones have 256 GB to 1 TB of NAND flash sitting right there. Why not store a big model in flash and stream it? Because NAND is slow relative to what token-by-token inference demands. Apple published the foundational work on this back in 2024 (*LLM in a Flash*); AFM 3 Core Advanced is that research line shipped.

How it works, and how that's different from a MoE

A classic Mixture-of-Experts (Mixtral, DeepSeek-V3) works like this: each layer has N expert FFNs, a learned router scores every token at every layer, and the top-k experts fire. Because *any* expert can be needed for the *next* token, all experts must sit in fast memory. MoE sparsity saves FLOPs, it does nothing for memory footprint. A 20B MoE still needs ~20B parameters resident. Fine on an 8×H100 node; impossible on an iPhone.

AFM 3 Core Advanced solves for this.

1. The full 20B parameters live in NAND flash. 2. A small, always-resident dense block reads the prompt during initial processing (prefill) and selects a fixed set of experts *for that request*. 3. A high fraction of capacity is "shared experts" that are always active and always in DRAM; the input-dependent "routed experts" are paged from NAND into DRAM only when selected. 4. The selected routed experts are *patched together with the shared weights to form a dense model in DRAM* and then generation runs through what is effectively an ordinary dense network. 5. The model periodically reselects experts during generation, not every token.

The underlying technique is Instruction-Following Pruning (IFP), from Apple Research (Jan 2025). Structured pruning normally produces one fixed subnetwork: you remove rows/columns of the FFN weight matrices and every input thereafter runs through the same reduced model. IFP makes the sparsity mask input-dependent. A small predictor network takes the instruction, and outputs per-layer masks over the FFN's intermediate dimension, effectively selecting which rows of the up-projection and corresponding columns of the down-projection stay active for this request. Predictor and base model are trained jointly, so the FFN neurons organize into instruction-conditioned groups the predictor can reliably address. At inference, the mask is computed once from the prompt, the selected weights are gathered into a compact dense subnetwork, and the entire generation runs through it at full density; no per-token routing, no gather/scatter in the decode loop. The result in the paper: with 3B activated parameters, the IFP model beat a 3B dense baseline by 5–8 absolute points on math and coding and matched a 9B dense model. Conditioning the active parameters on the task recovers most of the quality of a model ~3× the active size.

So the differences from a classic MoE, compactly:

| | Classic MoE | AFM 3 Core Advanced (IFP-style) | | ------------------- | -------------------------- | ------------------------------------------- | | Routing granularity | Per token, per layer | Per prompt (periodic reselection) | | What sparsity buys | FLOPs | DRAM residency | | Expert location | All in fast memory | Full model in NAND; selected subset in DRAM | | Inference kernel | Sparse/gathered, irregular | Effectively dense after patching | | Expert granularity | Discrete expert FFNs | Rows/columns of FFN matrices | | Active size | Fixed top-k | Elastic: 1–4B chosen per use case |

Why these specific changes matter on a phone

Paging gigabytes from flash is expensive, but you pay it once per request instead of per token. The slow memory tier gets hit at human-perceptible timescales (a request) rather than machine timescales (a token). This is the single load-bearing idea of the whole architecture.

The result: an iPhone gets ~20B parameters of capacity at a ~1–4B DRAM-and-compute cost. Per Apple's human evals, AFM 3 Core Advanced at its *1B* activation beats the previous production systems convincingly.

Where per-prompt routing might crack

The following are our hypotheses, clearly labeled as such, derived from how the mechanism works.

1. The routing decision is a bet made before the work starts. A per-token MoE re-decides specialization thousands of times per generation; AFM commits to one expert committee at prefill. For launch workloads (dictation, TTS, summarization) the task is fully legible from the prompt, so the bet is safe. But heterogeneous requests (code + prose + a math sub-question; agentic loops where tool results inject new domains mid-trajectory) force one committee to cover everything.

2. Periodic reselection trades a quality problem for a latency problem. If the model drifts mid-generation (long chain, topic shift), reselection fixes routing by paging new experts from NAND *in the middle of streaming output*. Long generations may show visible cadence hitches at reselection boundaries, or Apple suppresses reselection frequency and eats the quality drift instead. Either way, the tradeoff doesn't vanish.

3. The selector is a small model making a high-stakes decision, with no per-token escape hatch. Out-of-distribution prompts like heavy code-switching (Hinglish dictation is an example), niche jargon, adversarially phrased requests can misroute the entire generation onto the wrong subnetwork. In a per-token MoE, a bad routing decision costs one token; here it poisons the whole response. The selector's robustness *is* the model's robustness, and a lightweight dense block is, by construction, the least capable component in the system.

The bottom line

AFM 3 Core Advanced is the first production deployment of a real idea: that on consumer hardware, the scarce resource isn't FLOPs, it's DRAM residency, and sparsity should be spent on *that*. Moving the routing decision from per-token to per-prompt is what makes NAND usable as a weights tier, and patching the selection into a dense model is what makes Apple Silicon happy. Those are the right calls for a phone in 2026.

But the architecture's strengths and weaknesses are the same fact viewed from two sides. It commits early and runs dense, which is exactly why it should struggle when requests are heterogeneous, generations are long, contexts are mixed, or the selector meets inputs it wasn't trained to route. Apple launched it on workloads (dictation, TTS, short assistant turns) where none of those conditions hold.

---

Sources

Apple ML Research — Introducing the Third Generation of Apple's Foundation Models (June 8, 2026 — architecture, eval numbers)

Instruction-Following Pruning for Large Language Models (arXiv 2501.02086, Jan 2025)

LLM in a Flash: Efficient LLM Inference with Limited Memory (arXiv 2312.11514)