From Single-GPU to Distributed Training: A Framework for Making the Right Call

April 20, 2026

A series on distributed training for finetuning LLMs. This post (the overview) is the framework; the four deep dives that follow take each strategy apart on the same hardware with the same model, so the numbers compare directly.

1. Distributed Data Parallel: How It Actually Works — Scale throughput when the model fits on one GPU. 2. ZeRO and FSDP: Model Sharding — Fit models that don't fit, by sharding weights, gradients, and optimizer state. 3. Tensor Parallelism and Sequence Parallelism — Shard inside each layer; wins when interconnect is slow or a single layer is oversized. 4. Pipeline Parallelism: How It Actually Works — Shard across layers; the axis that cheaply spans nodes.

---

The models driving the most capable AI systems today are large, and getting larger. Llama 3.1 ships variants up to 405B parameters. DeepSeek-V3 sits at 671B. Qwen3 goes up to 235B. The pattern is consistent: more parameters means better reasoning, stronger generalization, and the ability to handle tasks that smaller models simply can't. The tradeoff is that training and finetuning these large language models is expensive.

Single-GPU training has a clean mental model: one device owns everything, and the training loop proceeds without coordination. For small models, that holds end to end. The problems start when models scale past what a single GPU can hold.

Finetuning a 13B model on a standard instruction dataset (50,000 samples, 3 epochs, 2048-token sequences) can take a day or more on a single A100. At 70B, a single GPU isn't even an option; full Adam finetuning requires roughly 840 GB. The bottleneck is memory and compute, and a single GPU has hard limits on both. Parameter-efficient methods like LoRA and QLoRA reduce these costs significantly and are often the right first response. But they have limits. They are less effective when a task requires deep changes to the model's representations, and they're no help when throughput across large datasets is the constraint. When those limits are hit, distributed training is the next step. The right approach depends on which constraint is actually binding. Choosing wrong leads to expensive infrastructure that doesn't move the needle.

This post establishes the framework. The rest of the series builds on it.

All four deep dives use the same concrete setup (finetuning Qwen3-4B on AMI meeting transcripts on 2x RTX 4090s over PCIe) so the numbers compare directly. Each deep dive is self-contained — read in order for the full picture, or jump to the one that matches the problem in front of you.

---

What Consumes GPU Memory During Training

The out-of-memory error (OOM) is not mysterious. GPU memory has a fixed capacity, and training requires more than it can hold. It looks like this:

` torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.00 GiB. GPU 0 has a total capacity of 23.69 GiB of which 2.14 GiB is free. Including non-PyTorch memory, this process has 21.23 GiB memory in use. `

The instinct is to reduce batch size. That helps with activations, but it does nothing for the model itself. If weights, gradients, and optimizer states already exceed GPU capacity, the error will appear at batch size 1. Let's look at exactly what fills it. The example above is from finetuning a small 1.5B model; the numbers get significantly worse as model size grows.

When finetuning a model, four components compete for memory:

Parameters (weights)

The model's weights live in memory for the duration of training. A 13B parameter model in 32-bit floating point requires 52 GB (13B × 4 bytes). In 16-bit (BF16 or FP16), that drops to 26 GB. Training nowadays almost always uses BF16 for weights and computation.

Gradients

Backpropagation computes a gradient for every trainable parameter. Gradients have the same shape as the weights, so at BF16, a 13B model's gradients add another 26 GB. Weights and gradients together already total 52 GB before anything else is loaded.

Optimizer States

This is the component most consistently underestimated. Adam, the default optimizer for most finetuning work, maintains two additional tensors per parameter: a running mean of gradients (m) and a running mean of squared gradients (v). Both are typically kept in FP32 for numerical stability, even when the model weights are in BF16. That comes to 2 × 4 bytes × 13B parameters = 104 GB, from the optimizer alone.

Adding it all up for a 13B model:

| Component | Memory | |---|---| | Parameters (BF16) | 26 GB | | Gradients (BF16) | 26 GB | | Optimizer states (FP32) | 104 GB | | Total | ~156 GB |

An H100 has 80 GB (up to 94 GB in the NVL variant). A B200 goes up to 192 GB. An A100 has 80 GB (40 GB in the cheaper variant). A consumer RTX 4090 has 24 GB. On most GPUs available today, including every H100 and H200 variant, the model doesn't fit before a single training sample is loaded. High-memory cards like the NVIDIA B200 (192 GB) and AMD MI300X (192 GB) can hold it, but only 36 GB remain for activations, and that headroom disappears quickly as sequence length or batch size increases.

Activations

Activations are the intermediate values computed during the forward pass and retained for use during backpropagation. In a transformer, they scale with sequence length (quadratically, due to attention) and with batch size. This is the value that grows fastest when batch size or sequence length increases, and the most common cause of out-of-memory errors once the model itself fits. The error above is from a 24 GB consumer GPU finetuning Qwen2.5-1.5B: ~18 GB consumed by weights, gradients, and optimizer states, with the remaining 3 GB taken by already-loaded activations, leaving no room to allocate the next batch.

---

Two Problems, Different Solutions

The central distinction in distributed training is this: running out of memory and training too slowly are different problems. They require different solutions, and the solution to one does not resolve the other.

The throughput problem: the model fits, but training takes too long.

If the model fits in memory but a full training run would take an impractical amount of time, the bottleneck is compute throughput, not memory. The solution is to run multiple copies of the model in parallel, each processing a different slice of the data, and synchronize gradients at the end of each step. Data Parallelism is the standard approach for this; PyTorch's implementation of it is called Distributed Data Parallel (DDP).

It's worth being explicit here: DDP does nothing for memory. Every GPU still holds a full copy of the model. If the model didn't fit on one GPU, adding more GPUs under DDP won't help. Where it helps is throughput: the same training run completes in a fraction of the time.

The memory problem: the model doesn't fit.

If weights, gradients, and optimizer states exceed GPU memory before a single training sample is loaded, reducing batch size provides no relief. The model is out of memory at batch size 1; the problem is independent of data volume. This is the situation with a 13B model on a 24 GB GPU, or a 70B model on anything short of a large cluster.

The solution is to distribute the model itself across multiple GPUs, sharding weights, gradients, and optimizer states so no single device holds the complete set. FSDP, tensor parallelism, and the ZeRO family (ZeRO-1/2/3) all address this class of problem, each with different tradeoffs in communication overhead and implementation complexity. FSDP is PyTorch's native implementation of the ZeRO family, with its FULL_SHARD mode corresponding to ZeRO-3.

This distinction (memory constraint versus throughput constraint) is the first decision point in distributed training. Getting it wrong is expensive.

---

Distributed Training Strategies

Each strategy addresses a specific combination of memory and throughput constraints. DDP and FSDP are the ones most finetuning workflows will encounter; tensor parallelism, pipeline parallelism, and sequence parallelism are more specialized but worth understanding, and they occasionally become the right choice for finetuning too (TP+SP wins on commodity hardware without NVLink, for example).

Data Parallelism (DDP)

*Use when: the model fits on one GPU and training throughput is the bottleneck.*

Each GPU holds a complete copy of the model. Data is partitioned across GPUs, and each performs a full forward and backward pass on its portion. Gradients are averaged across all GPUs via AllReduce, keeping all copies synchronized. The optimizer step then runs independently on each GPU.

Throughput scales with the number of GPUs (near-linearly in well-optimized setups, with reported efficiencies of 0.9-0.95x per GPU). Memory per GPU remains unchanged, equivalent to single-GPU training. For finetuning scenarios where the model fits in memory, DDP is the right starting point and handles the majority of practical use cases.

All ranks must be homogeneous: same GPU model, same memory. AllReduce assumes equal contribution from every rank, and the slowest GPU determines the pace of every step. Mixed hardware (e.g. A100 and H100 in the same job) technically works but causes throughput degradation as faster GPUs idle waiting for slower ones. This holds for FSDP and tensor parallelism as well; distributed training in general assumes a homogeneous cluster.

Note: DDP uses Ring-AllReduce to synchronize gradients. GPUs are arranged in a logical ring; each passes partial gradients to its neighbor until every rank has the full average.

Fully Sharded Data Parallelism (FSDP)

*Use when: the model barely fits or doesn't fit, but throughput still matters.*

FSDP shards weights, gradients, and optimizer states across all participating GPUs. No single GPU holds the complete model. During the forward pass, each GPU gathers the weights it needs, computes its portion, then releases them. During the backward pass, gradients are computed locally, then reduced and re-sharded.

Memory per GPU drops proportionally with the number of GPUs. A 13B model that requires 156 GB on one device ideally requires roughly 39 GB per device across four.

Note: Real usage runs higher than this estimate. Activations are not sharded and each GPU holds its own copy. FSDP also allocates temporary communication buffers during all-gather and reduce-scatter that can spike usage above the steady-state number. Parameter counts rarely divide evenly across shards, adding small alignment overhead on top.

The memory savings come at a cost: FSDP requires significantly more inter-GPU communication than DDP. Where DDP communicates once per step (gradient AllReduce), FSDP communicates at every layer boundary in both the forward and backward pass. This makes interconnect bandwidth a critical factor; FSDP on NVLink performs very differently from FSDP across PCIe or across nodes. For finetuning at the 13B, 70B, or larger scale where memory is the binding constraint, it remains the right tool, but the communication overhead is the reason DDP is preferred when the model fits.

Tensor Parallelism (TP)

*Use when: individual layers are too large for a single GPU, or the interconnect is too slow for FSDP's weight traffic.*

Tensor parallelism splits individual weight matrices across GPUs. A single attention layer or MLP is divided across devices, with each computing a portion of the output. The results are then combined through AllReduces inside each layer.

Because synchronization happens inside each layer on every step, TP is sensitive to interconnect latency. It is designed for high-bandwidth intra-node links (NVLink), but on commodity hardware *without* NVLink it can actually beat FSDP: TP moves only activations (~1–2 GB per step) while FSDP moves the full model's weights three times (~3x model size per step). On a pair of PCIe-connected consumer GPUs, that makes TP+SP substantially faster than FSDP for the same model. See → TP+SP for the measured comparison.

Sequence Parallelism (SP)

*Use when: sequence length is the memory bottleneck, not the model weights.*

Standard TP leaves activations outside the attention and MLP blocks fully replicated on every GPU. At long context lengths (32k, 128k tokens), those replicated activations become the dominant memory consumer even when weights are already sharded. Sequence parallelism shards exactly those activations along the sequence dimension, so each device processes a 1/N slice. The AllGather/ReduceScatter calls it introduces replace TP's AllReduces one-for-one, with the same total bytes moved — you get the activation savings for free.

SP is almost always combined with TP, sharing the same process group and communication primitives. It is relevant for long-context finetuning, instruction tuning on long documents, and any setting where sequence length rather than model size is driving OOM errors.

Pipeline Parallelism (PP)

*Use when: the model must span multiple nodes and the inter-node interconnect is too slow for tensor parallelism.*

Pipeline parallelism assigns different model layers to different GPUs. GPU 0 handles layers 0–11, GPU 1 handles layers 12–23, and so on. Data moves through the pipeline sequentially, with each GPU passing its outputs to the next stage.

The sequential flow introduces overhead: early-stage GPUs sit idle waiting for later stages to finish before the next batch can enter, a gap called a pipeline bubble. Microbatching and schedules like 1F1B shrink the bubble, but it never disappears entirely.

The reason to accept this tradeoff is interconnect. TP synchronizes inside every layer on every step, which requires NVLink-level bandwidth within a node. PP only communicates at layer boundaries (one hidden-state tensor per microbatch) and is tolerant of slower inter-node links. For standard finetuning at 4B-70B parameters this is almost never the right tool; its natural home is 100B+ models spanning multiple nodes.

---

A Decision Framework

The right strategy depends on which constraint is actually binding. Work through the following questions in order.

0. Can LoRA or QLoRA solve the problem?

Before reaching for distributed infrastructure, try parameter-efficient finetuning. QLoRA can fit a 13B model that requires 156 GB for full Adam finetuning on a single 24 GB GPU. For many downstream tasks the results are comparable to full finetuning. Only move past this step if the task requires deep representation changes, is a pretraining run, or if throughput across a large dataset is the binding constraint.

1. Does the model fit on one GPU for full finetuning?

If yes, and throughput is the bottleneck, use DDP. Each GPU holds a full model copy, data is split across GPUs, and gradients are averaged via AllReduce. Throughput scales near-linearly with GPU count. This is the lowest-complexity distributed option and the right starting point.

If activations fill memory before a useful batch size is reachable, try gradient accumulation first. It simulates larger batches by accumulating gradients over multiple forward passes before stepping the optimizer, with no additional hardware required. DDP can be layered on top if throughput still falls short.

2. Do optimizer states cause the OOM, but the model weights fit?

If weights and gradients fit but Adam's optimizer states (which are 4x the size of weights in FP32) push usage over the limit, the ZeRO family offers graduated relief without the full communication cost of FSDP:

  • ZeRO-1: shards only optimizer states across GPUs. Least communication overhead. Use when optimizer states alone are the bottleneck.
  • ZeRO-2: shards optimizer states and gradients. More memory savings, slightly more communication.
  • ZeRO-3 / FSDP: shards everything including weights. Maximum memory reduction, highest communication cost. Use when the model itself doesn't fit.
  • In PyTorch, FSDP exposes ZeRO-2 (SHARDGRADOP) and ZeRO-3 (FULLSHARD) directly, plus a NOSHARD mode equivalent to DDP. It has no native ZeRO-1 equivalent — for optimizer-state-only sharding, reach for DeepSpeed ZeRO-1, which (with Hugging Face Trainer) is configured explicitly.

    3. Does the model require multiple GPUs to fit at all?

    Use FSDP (or DeepSpeed ZeRO-3). Weights, gradients, and optimizer states are sharded across all GPUs; no single device holds the complete model. The number of GPUs required depends on model size and available device memory. FSDP assumes a fast interconnect (NVLink, or fast inter-node links in hybrid-shard configurations); on systems without NVLink, the per-layer AllGather/ReduceScatter traffic can dominate step time — in which case tensor parallelism + sequence parallelism is often the better choice, because it moves only activations across the interconnect rather than full weights.

    At 70B+ parameters, FSDP across a single node is often insufficient. At this scale, FSDP combined with tensor parallelism becomes necessary: FSDP handles sharding across node boundaries while TP splits individual layers within a node over NVLink.

    4. Is sequence length the bottleneck?

    For long-context finetuning (32k, 128k token sequences), activations scale quadratically with sequence length and become the dominant memory consumer even when the model itself fits. Sequence parallelism splits activation tensors along the sequence dimension across GPUs. It is layered on top of tensor parallelism since both operate at the intra-layer level and share the same communication operators.

    The large majority of finetuning workflows fall into steps 0 through 3: QLoRA or LoRA first, then DDP if throughput is the issue, then FSDP (or TP+SP on slower interconnects) if memory is.

    ---

    Finetuning is fundamentally a resource allocation problem. The model has a fixed memory footprint, the dataset has a fixed size, and the hardware has fixed limits. Distributed training is not a single solution to that problem; it is a family of techniques that each address a specific part of it, and picking the wrong one wastes infrastructure without moving the needle.

    The practical path for most finetuning work is narrower than the full landscape suggests. LoRA or QLoRA resolves the majority of memory constraints without any additional hardware. When full finetuning is required and the model fits, DDP turns one GPU's worth of compute into many. When the model doesn't fit at all, FSDP distributes the memory burden across devices — or TP+SP does, if the interconnect is too slow for FSDP's weight traffic. PP only enters the picture when the model must span multiple nodes, which is rare for finetuning.

    These techniques compose. QLoRA + DDP is a common and practical pattern: QLoRA's 4-bit quantization reduces the model's memory footprint enough to fit on each GPU, while DDP runs multiple copies in parallel and synchronizes gradients across them. Memory savings from QLoRA, throughput scaling from DDP, without the complexity of model sharding.

    Understanding the full landscape matters not because all of it will be used, but because knowing where each technique applies makes it possible to diagnose the actual constraint quickly and reach for the right tool without guessing. The four deep dives that follow this post (→ DDP, → FSDP, → TP+SP, → PP) take each tool apart.