LLM Inference at the Edge

March 30, 2026

LLM Inference at the Edge: What Actually Happens When the Device Gets Hot

---

Most LLM benchmarks report a single number: peak throughput on a fresh device. That number is useful for marketing. It is not useful for deployment.

Real edge inference runs back-to-back requests. A phone assistant doesn't cool down between queries. An RPi-based home server doesn't get a five-minute break after every conversation. When these devices run 20 consecutive inference requests, the performance curve tells a very different story from the peak number. On some platforms it barely moves. On others it falls off a cliff.

This post covers what we found when we benchmarked Qwen 2.5 1.5B (4-bit) across four platforms: an RTX 4050 laptop, a Raspberry Pi 5 with a Hailo-10H NPU, an iPhone 16 Pro, and a Samsung Galaxy S24 Ultra. The focus throughout is sustained performance, not peak. What happens at iteration 15 of 20, when the device is hot and the scheduler is starting to fight you.

Team: Pranay Tummalapalli, Sahil Arayakandy, Ritam Pal, Kautuk Kundan (Conscious Engines)

---

Why These Choices

The Model: Qwen 2.5 1.5B Instruct (4-bit)

A 1.5B model at 4-bit quantization fits in under 2 GB of device memory, which makes it deployable across all four targets without any platform-specific size compromises. It is also large enough to actually exercise real inference pipelines, not toy forward passes. Using the same model across all platforms eliminates quality as a variable. We are benchmarking the hardware and runtime stack, not the model.

The quantization format differs per platform: GPTQ on the GPU, MLC binary on Android, native MLX weights on iOS, Hailo-compiled HEF on the NPU. The model is the same; the packaging is native to each target.

The Prompt Length: 250 to 500 Tokens

Long enough to stress the prefill stage and produce a meaningful decode window. Short enough to complete a run in a few seconds, which matters because we need 20 iterations before the device overheats.

The Iteration Count: 20 Back-to-Back

20 iterations is where thermal throttling becomes visible. A single-shot benchmark is a cold-start measurement. After 5 iterations, mobile devices are warm. After 10, some are hot. After 20, the sustained behavior is clear: either the device holds its performance or it doesn't.

The 1-second gap between iterations is deliberate. It lets the OS scheduler breathe without masking thermal build-up. Any shorter and we risk measuring scheduler contention rather than thermal throttling. Any longer and the device starts cooling between runs.

---

The Platforms

RTX 4050 Laptop (Linux + PyTorch + vLLM)

The laptop GPU is the control condition: active cooling, a real thermal budget, and a mature inference ecosystem. We built five benchmark scripts, starting from raw PyTorch FP16 decode and ending at vLLM with hardware monitoring.

Inference frameworks are not interchangeable:

| Framework | Throughput | Notes | |---|---|---| | PyTorch FP16 (direct) | ~60 tok/s | Baseline, no batching, no kernel fusion | | BitsAndBytes NF4 | ~70 tok/s | Dynamic quantization overhead on first call | | GPTQ-Int4 (auto-gptq) | ~90 tok/s | Pre-quantized weights, faster decode | | vLLM (GPTQ-Int4) | 131.7 tok/s | PagedAttention + continuous batching |

vLLM's PagedAttention eliminates KV-cache fragmentation and enables better GPU utilization. For single-request benchmarking it still wins significantly over plain Transformers because of CUDA kernel optimizations.

Thermal stability is a structural GPU advantage:

The RTX 4050 ran 20 iterations at a coefficient of variation of 2.2%. The laptop's cooling solution is sized for a 70 W TDP. At 34 W for inference, it barely registers. The fan ramps slightly and the system holds. This is structurally different from passively cooled mobile SoCs, where the same heat has nowhere to go.

One instrumentation note worth flagging:

nvidia-smi --loop-ms=100 is too coarse for power spikes. GPU power sampling at 100 ms granularity misses sub-100 ms events. A background HardwareMonitor thread on a separate Python thread worked but introduced minor lock contention. The right solution is pynvml, which gives programmatic NVML access at under 10 ms granularity. We would do this differently if running the experiment again.

---

Raspberry Pi 5 + Hailo-10H NPU

The RPi 5 and Hailo-10H pairing is the most architecturally interesting setup in the experiment. They are two physically distinct processors connected via M.2 PCIe, and understanding which stage runs where is essential to interpreting the numbers.

#### Where Each Stage Runs

` ┌─────────────────────────────────────────────────────────────┐ │ Raspberry Pi 5 (BCM2712) │ │ │ │ ┌──────────────────────────────────────────────────────┐ │ │ │ ARM Cortex-A76 × 4 (CPU) │ │ │ │ │ │ │ │ 1. Tokenisation (tiktoken / sentencepiece) │ │ │ │ 2. KV-cache management (host RAM, LPDDR4X) │ │ │ │ 3. Sampling / top-p/k (numpy, single-threaded) │ │ │ │ 4. De-tokenisation (vocab lookup) │ │ │ │ 5. hailo-ollama server (HTTP, JSON, SSE framing) │ │ │ └──────────────────────────────────────────────────────┘ │ │ │ PCIe Gen 2 ×1 (~4 GB/s) │ │ ▼ │ │ ┌──────────────────────────────────────────────────────┐ │ │ │ Hailo-10H NPU (M.2 2242, via PCIe) │ │ │ │ │ │ │ │ 6. Linear projections (Q, K, V, O, FFN layers) │ │ │ │ 7. Matrix-vector multiply (weight × activation) │ │ │ │ 8. Activation functions (SiLU, RMSNorm fused) │ │ │ │ compiled into HEF dataflow graph │ │ │ │ fixed at compile time, not dynamic │ │ │ └──────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘

Out-of-band: INA219 sensor ──I2C──▶ RPi GPIO ──▶ benchmark script (measures Hailo 3.3V rail at 1 kHz, not visible to hailo-rt) `

The Hailo-10H only handles the matrix-heavy parts of each transformer layer: the expensive O(d²) operations. Everything involving variable-length sequences, dynamic decisions, and the autoregressive loop lives on the CPU. This is unlike a GPU, where the entire forward pass including sampling runs in device memory.

#### What Happens Per Token

` CPU Hailo-10H NPU │ │ │ [hidden state, shape: 1×2048] │ │ ──────── PCIe DMA transfer ───────────────▶│ │ │ Linear: Q projection (2048→2048) │ │ Linear: K projection (2048→2048) │ │ Linear: V projection (2048→2048) │ │ (RoPE embeddings: compiled into HEF) │ │ Linear: O projection (2048→2048) │ │ FFN gate (2048→5632, SiLU) │ │ FFN up (2048→5632) │ │ FFN down (5632→2048) │ │ RMSNorm (fused) │◀──────── PCIe DMA transfer ────────────────│ │ [updated hidden state, shape: 1×2048] │ │ │ │ KV-cache append (in LPDDR4X host RAM) │ │ Attention score × past KV (on CPU) │ │ top-p sampling → next token id │ │ de-tokenise → emit to HTTP stream │ `

Attention over the KV cache runs on the CPU, not the NPU. This is a meaningful constraint: as the sequence grows, attention cost on the CPU increases, while the NPU's per-layer cost stays constant. For 250-token prompts with 200-token decodes, it is manageable. At longer contexts, it would dominate.

#### How the HEF Gets Built

The HEF (Hailo Executable Format) is a fully compiled, static dataflow graph. Unlike GPU inference where kernels are JIT-compiled or cached at runtime, everything is baked in at compile time:

` ONNX model │ ▼ Hailo Model Zoo compiler │ ├── Operator fusion (e.g. Linear + RMSNorm → single HEF node) ├── Weight quantization (INT8 activations, INT8/INT4 weights) ├── Dataflow scheduling (static, no dynamic dispatch at runtime) ├── Memory layout optimization (for Hailo's on-chip SRAM tiling) └── PCIe DMA descriptors (pre-allocated transfer descriptors) │ ▼ qwen2.5-1.5b.hef (~400 MB) `

This static compilation has real consequences. Sequence length is fixed at compile time, so long-context inference requires recompilation. There is no runtime quantization swap. And the deterministic dataflow schedule is exactly why the Hailo achieves a CV of 0.04%: no kernel dispatch jitter, no stream synchronization overhead, no variance at all. The same graph runs the same way every time.

#### Hailo vs. CPU: Side by Side

` Time ──────────────────────────────────────────────────────────▶

HAILO PATH (6.9 tok/s, ~145 ms/token):

CPU │◀── PCIe ──▶│◀─── wait ───────────────────────────────▶│◀── PCIe ──▶│ sample │ emit │ NPU │ │◀────────── 28-layer HEF graph ──────────▶│ │ │ │

├── ~2 ms ───►├──────────────────── ~130 ms ─────────────►├── ~2 ms ───►├─ ~5 ms►├─1ms─► DMA in NPU computes all linear + FFN ops DMA out CPU attn samp

CPU-ONLY PATH (~4 tok/s, ~250 ms/token):

CPU │◀──────────────────────── all 28 layers (NEON GEMV) ────────────────────────▶│ sample │ ├──────────────────────────────── ~240 ms ────────────────────────────────────►├─ ~5 ms► Linear + attention + FFN × 28, serialized on 4 ARM cores samp `

The NPU path has higher per-token latency from the PCIe round-trip overhead (around 4 ms total), but the NPU's parallel dataflow computes all 28 layers faster than the CPU can do them serially.

#### The Numbers

| Mode | Throughput | Power | Energy/token | |---|---|---|---| | Hailo-10H NPU | 6.9 tok/s | <2 W | ~290 mJ | | RPi 5 CPU (Ollama) | ~4 tok/s | ~5 W | ~1250 mJ |

The NPU is not dramatically faster in tokens per second. Its advantage is energy efficiency. At under 2 W with CV=0.04%, it is thermally inert. It will run 10,000 iterations without throttling. The CPU bottleneck (attention over the growing KV cache on ARM) is the reason the throughput gain is not larger.

One measurement detail worth noting:

The Hailo NPU has no OS-visible power interface. There is no equivalent of nvidia-smi or RAPL for it. The only way to get real power numbers is an INA219 current sensor on the physical power rail, sampled at 1 kHz via I2C and correlated with benchmark timestamps offline. This is not a software limitation with a software fix. It is a hardware design choice, and it adds real measurement overhead to any serious evaluation.

---

iPhone 16 Pro (Swift + MLX)

We built a SwiftUI iOS app with a benchmark runner using Apple's MLX framework for on-device inference. The stack looks like this:

` SwiftUI ContentView → LLMEvaluator (@Observable, @MainActor) → LLMModelFactory (mlx-swift-examples) → MLX compute graph (mlx-swift) → Metal (Apple GPU compute shaders) → Apple A18 Pro GPU (6-core, unified memory) `

MLX uses a lazy evaluation model where operations build a compute graph and .eval() triggers execution. The Metal GPU backend leverages unified memory, where CPU and GPU share the same physical RAM, which eliminates the PCIe copy bottleneck that affects discrete GPU inference.

Thermal throttling is the dominant constraint:

| Iterations | Throughput | Thermal State | |---|---|---| | 1-5 | ~40 tok/s | Normal to Warm | | 6-10 | ~30 tok/s | Warm to Hot | | 11+ | ~22.6 tok/s | Hot to Very Hot |

That is a 44% drop from iteration 1 to iteration 8. iOS's thermal throttle is aggressive and symmetric: it scales down GPU clock uniformly across all workloads. Unlike Android (covered below), it does not hard-floor at a fixed frequency but reduces dynamically. The result is a gradual but steep curve rather than a sudden cliff.

iOS power measurement is battery-only, and that is not a workaround failure:

iOS does not expose per-component power draw to third-party apps. UIDevice.batteryLevel rounded to 1% is the only energy proxy available. Over 20 iterations (roughly 5 minutes), we saw 3-5% battery drain, which is too coarse for per-token energy estimates. There is no API to unlock finer granularity. This is a fundamental platform design choice, not a missing library.

---

Samsung Galaxy S24 Ultra (Kotlin + MLC-LLM + TVM + Adreno)

We forked the MLC-LLM Android app and built a custom BenchmarkService: a foreground Android service that runs automatically on launch, loads the model, executes 20 benchmark iterations, and logs telemetry to CSV. The stack:

` BenchmarkService (Kotlin, Foreground Service) → MLCEngine.kt (Kotlin JNI wrapper) → libmlc_llm.so (C++ TVM runtime) → TVM relax compiled model (Q4f16_2, OpenCL) → Adreno 750 GPU (via OpenCL) `

The full build pipeline for this target is the heaviest in the experiment:

1. Clone mlc-llm with TVM submodule (3rdparty/tvm/) 2. Run mlc_llm compile on the Qwen 2.5 1.5B checkpoint to generate a .so and mlc-chat-config.json 3. prepare_libs.py copies the compiled .so into the Android project's JNI libs directory 4. CMake builds the JNI bridge (mlc4j/) 5. Gradle assembles the APK with ABI filter arm64-v8a 6. adb install and adb shell am start

The Adreno throttle is not gradual. It is a binary cliff:

At full frequency (~900 MHz), the S24 Ultra produces around 25 tok/s. After hitting the thermal threshold, Snapdragon 8 Gen 3's Adreno 750 hard-floors at 231 MHz. The result is roughly 5 tok/s. That is not a curve; it is a step function. Only 5 valid iterations were possible before hitting the floor.

This is Samsung's thermal policy, not Qualcomm's. Other OEMs using the same SoC show different curves. Thermal policy is OEM firmware, and it varies.

Android power APIs are unreliable for precision telemetry:

BatteryManager.BATTERYPROPERTYCURRENTNOW returns the wrong sign on many Samsung devices. BATTERYPROPERTYCHARGECOUNTER drifts. We excluded Android power figures from the final analysis. The Adreno GPU frequency readings from sysfs were reliable:

` /sys/class/kgsl/kgsl-3d0/devfreq/cur_freq # current GPU clock /sys/class/kgsl/kgsl-3d0/devfreq/max_freq # thermal cap (when set) /sys/class/kgsl/kgsl-3d0/gpubusypercentage # utilization /sys/class/thermal/thermal_zone*/temp # zone temperatures `

These paths are Adreno-specific and not guaranteed across SoCs. The kgsl driver is Qualcomm's proprietary GPU driver exposed through sysfs.

Running a 20-iteration benchmark without Android killing the process requires more work than expected:

startForeground() with a persistent notification

WakeLock (PARTIALWAKELOCK) to prevent CPU sleep

FLAGKEEPSCREEN_ON + setShowWhenLocked + setTurnScreenOn to prevent the lock screen

setOomScore(-800) to lower low-memory-killer priority

Without this, Android Doze kills background services after roughly 10 minutes. The benchmark needs all of it to complete reliably.

---

What the Numbers Say

The Sustained vs. Peak Gap

| Platform | Peak (iter 1) | Sustained (iter 20) | Drop | |---|---|---|---| | RTX 4050 | ~135 tok/s | ~131 tok/s | 3% | | Hailo-10H | 6.9 tok/s | 6.9 tok/s | 0% | | iPhone 16 Pro | 40 tok/s | 22.6 tok/s | 44% | | Galaxy S24 Ultra | ~25 tok/s | ~5 tok/s | 80% |

The laptop and the NPU hold. The phones do not. Mobile SoCs are designed for burst workloads: photo processing, gaming at 60 fps, brief ML inference for face unlock. They are not designed for continuous high-compute jobs at sustained load, and LLM inference stress-tests a thermal regime they were not built for.

Quoting peak performance for sustained workloads is not just misleading; it inverts the actual deployment story for two of the four platforms here.

Memory Bandwidth is the Real Bottleneck

For single-user edge inference, LLM decode is memory-bandwidth-bound, not compute-bound:

Each token requires reading all model weights (~750 MB for 1.5B 4-bit) from device memory

At 131 tok/s (RTX 4050), that is 131 × 750 MB = ~98 GB/s effective weight read throughput

RTX 4050 memory bandwidth is 192 GB/s, which checks out (KV cache and activations add overhead on top)

This is why 4-bit quantization is the practical deployment format everywhere: it halves the weight read volume per token compared to 8-bit. The benefit is bandwidth, not compute. Models get faster because there is less to read from memory, not because there are fewer multiplications.

Energy Efficiency and Raw Throughput Trade Off by Architecture

The Hailo-10H is the most energy-efficient device in this study at ~290 mJ per token. The RTX 4050 is 17x less efficient at around 5000 mJ per token, but produces 19x the throughput. The iPhone is in the middle for both. The Samsung, at the throttled floor, produces the worst energy and throughput numbers in the study.

The right choice between these depends on what you are optimizing for. Battery-powered, always-on edge devices where power budget matters: the NPU architecture wins by a large margin. Applications where response latency is the constraint and power is less of a concern: the GPU wins.

Build Complexity Scales Dramatically

| Platform | Build complexity | Approximate build time | |---|---|---| | RTX 4050 | pip install -r requirements.txt | Under 5 minutes | | RPi 5 + Hailo | pip install requests + hailo-ollama setup | ~30 minutes | | iPhone 16 Pro | Xcode SPM | ~20 minutes, first build | | Galaxy S24 Ultra | Conda + TVM + NDK + CMake + Gradle | 2-4 hours |

The Android and MLC-LLM build took more engineering time than the other three platforms combined. TVM requires a specific Python environment, the NDK version matters for ABI compatibility, and the nested submodule structure (mlc-llm, with TVM inside it) creates dependency management problems that surface at every environment change. This is not a reason to avoid the platform, but it is a real cost that should factor into project planning.

---

The Architecture in One View

` TinyBench │ ┌─────────────┼──────────────┬──────────────┐ │ │ │ │ RTX 4050 RPi 5 + Hailo iPhone 16 Pro S24 Ultra (Linux) (Linux ARM) (iOS) (Android) │ │ │ │ PyTorch/vLLM hailo-ollama MLX (Swift) MLC-LLM (TVM) │ │ │ │ CUDA kernels Hailo HEF Metal shaders OpenCL kernels │ │ │ │ GDDR6 (8 GB) M.2 LPDDR5 Unified (8 GB) LPDDR5X (12 GB) `

No two platforms share a runtime. This was intentional. A portable-but-slow solution (one Python framework running on all four targets) would have produced unrepresentative results. Native inference on each target is what the actual deployment scenario looks like.

---

What We Would Do Differently

More granular power measurement on Android. The INA219 approach that worked on the RPi should have been applied to the S24 Ultra via USB power delivery monitoring. Android's software APIs are unreliable, and we knew that early enough to plan around it.

NVML instead of nvidia-smi subprocess. The subprocess approach works but has 100 ms granularity and process fork overhead. pynvml gives programmatic access at under 10 ms with no subprocess overhead.

Longer burn-in for iOS. 20 iterations was not enough to see whether throughput stabilizes after the initial thermal drop or continues declining. 50 iterations with a longer inter-iteration gap would have been more informative.

Hailo at higher batch sizes. The Hailo NPU's dataflow architecture may be more competitive at batch size greater than 1, where multiple simultaneous users are being served. Single-user benchmarking may undersell it.

---

Key Takeaways

Thermal management is the binding constraint for sustained mobile inference, not raw compute capacity. The phones both throttle severely. The laptop and the NPU do not. This is the central finding and it is not visible in any peak-throughput benchmark.

Energy efficiency and throughput trade off differently by architecture. The Hailo NPU is the most efficient device here by a wide margin. The RTX 4050 is the fastest but uses 17x more energy per token. Neither is universally better; the right choice depends on the deployment constraint.

Cross-platform LLM benchmarking requires platform-native stacks. vLLM on a laptop, hailo-ollama on an RPi, MLX on iOS, and MLC-LLM on Android are not interchangeable. Each is the right tool for its target.

Android's software power APIs are unreliable for precision telemetry. Out-of-band hardware measurement is the only trustworthy approach. This is worth knowing before designing the experiment, not after.

Build complexity is a real barrier to mobile ML research. The MLC-LLM Android build is a 2-4 hour process with multiple environment dependencies. This cost should be planned for explicitly.

Sustained inference is the relevant deployment scenario. Chatbots, edge assistants, and always-on applications make consecutive requests. Benchmarks that only report peak numbers are not describing the device that ships to users.

---

*Published March 2026. Project repository: conscious-engines/edge-benchmarking*