Strix Halo just got up to 2.4x faster on Qwen3.6, from a llama.cpp update and no new hardware

By LLM Requirements · July 1, 2026

#performance #amd #strix-halo #speculative-decoding #llama-cpp #qwen

A llama.cpp update (PR #22673, merged 2026-05-16) added built-in multi-token-prediction speculative decoding. On a Strix Halo mini-PC it lifts single-stream Qwen3.6 decode about 1.4x on the 35B-A3B MoE and about 1.8x to 2.4x on the 27B dense model, a software-only gain on a box you already own.

As of July 2026.

If you own a Strix Halo box, the one running Qwen3.6 in the mid-teens or the low-fifties tok/s, it got faster in May and you did not have to buy anything. A llama.cpp update merged on May 16 added built-in speculative decoding for the models people actually run on these machines, and on the same 128 GB mini-PC it lifts single-stream decode by roughly 1.4x on the 35B-A3B mixture-of-experts model and by about 1.8x to 2.4x on the dense 27B, depending on the quant. No new silicon, no cluster, just a newer build of the same open engine.

This is the non-NVIDIA half of a story we have been telling about the DGX Spark: on memory-bandwidth-bound hardware, the throughput you can plan around comes from how many tokens you verify per forward pass, not from the box. The Spark gets there with a custom drafter. Strix Halo now gets there with a feature that ships in mainline llama.cpp.

What changed

The update is llama.cpp PR #22673, merged May 16, 2026. It adds first-class support for multi-token prediction (MTP), a form of speculative decoding. The idea is simple: some models, including Qwen3.6, ship with a small extra head trained to guess the next few tokens. That head drafts a short run of tokens cheaply, then the full model verifies the whole run in a single forward pass. When the guesses are right, you get several tokens for the memory cost of one, so a bandwidth-starved box does less work per token it keeps.

The measured behavior across hardware in the PR is about a 2x decode speedup with a draft-acceptance rate near 75 percent at three draft tokens, and the extra MTP head costs under 10 percent more memory. It is not Strix Halo specific: the same feature was tested on NVIDIA CUDA, AMD ROCm (including the Radeon AI PRO R9700), and Apple Metal. Strix Halo is just where the gain lands hardest, and here is why.

The measured figures on Strix Halo

These are from a hands-on run on a Framework Desktop (Ryzen AI Max+ 395, Radeon 8060S integrated GPU, 128 GB LPDDR5X) published two days after the merge, on a post-merge build with ROCm 7.2.3. All single-stream, short-context chat, MTP set to three draft tokens.

Model	Quant	Before (no MTP)	After (MTP)	Speedup
Qwen3.6 35B-A3B (MoE)	Q4_K_M	~53 tok/s	~69 tok/s	1.40x
Qwen3.6 27B (dense)	Q4_K_M	~12 tok/s	~21 tok/s	1.81x
Qwen3.6 27B (dense)	Q8_0	~7.7 tok/s	~18 tok/s	2.44x

The pattern is worth reading, because it is the opposite of the usual intuition. The big dense 27B gains the most, and the larger 35B MoE gains the least. That is not a mistake. The MoE only activates about 3B of its 35B parameters per token, so its baseline is already quick and there is less bandwidth pressure for speculative decoding to relieve. The dense 27B runs all 27B parameters every token, which is the real bottleneck on a 256 GB/s box (closer to 215 GB/s in practice), so cutting the memory traffic per token helps it most. The Q8_0 row shows the ceiling: a heavier quant is more bandwidth-bound, so MTP nearly cuts its time in half.

The same run notes that Strix Halo gets a bigger relative speedup from MTP than a discrete card like a 3090, precisely because the integrated GPU already saturates its memory ceiling. Speculative decoding gives back the most where bandwidth is scarcest.

The number to be careful with

You will see a “4.8x faster” figure attached to this exact combination. It comes from an earlier write-up (May 6, before the merge) that measured 6.20 tok/s up to 29.79 tok/s on the 27B. Read the setup before you quote it: that baseline used a quantized q4_0 KV cache on a pre-merge branch, and 6.20 tok/s is about half of what a 27B dense Q4 should decode at on this box. A hobbled baseline makes any speedup ratio look bigger. The reproducible range on a current build is the 1.4x to 2.4x above. We flag this because the difference between “1.8x” and “4.8x” is exactly the kind of number that ends up on a spec sheet and then does not reproduce.

What it does not change

Speculative decoding raises throughput; it does not change what the model knows or how much of it fits. Qwen3.6 27B in Q4 sits around 15 GB, and the 35B-A3B in Q4 around 22 GB, so both fit comfortably in 128 GB with room for a long context. MTP does not shrink the model, and it does not help you run something that did not fit before. It makes what already runs, run faster.

Two more limits. These are single-stream numbers, one request at a time with the whole GPU on it; under concurrency, per-stream throughput falls, because batching and speculative decoding compete for the same headroom (the MoE row in particular gives back most of its gain once you serve several users at once). And the acceptance rate depends on the workload: predictable, structured text (code, tool calls) accepts more draft tokens than free-form prose, so your mileage tracks what you actually generate.

Why this matters for what to buy

The takeaway is not “buy a Strix Halo.” It is that a mid-range unified-memory box people already own quietly moved up a tier on the models they run, from an open-source update, and that keeps happening across this hardware class. The same PR benefits the Radeon AI PRO R9700 and Apple Silicon through Metal. When you are comparing a 256 GB/s mini-PC against a discrete GPU or a Spark, the software floor is not fixed, and the boxes that lean hardest on memory bandwidth are the ones with the most left to gain from the next merge.

If you are trying to work out which model your own machine can run, and how fast, the model picker matches every open model against your memory and shows the builds that actually fit, so you can see whether a 27B dense or a 35B-A3B MoE is in reach on your box before you change anything. For the hardware itself, the Strix Halo 128 GB page carries the current price, memory, and bandwidth.

Sources

llama.cpp PR #22673 (merged 2026-05-16): adds multi-token-prediction speculative decoding; ~2x decode, ~75 percent draft acceptance at three draft tokens, under 10 percent extra memory; tested on CUDA, ROCm (incl. R9700), and Metal; supports Qwen3.6-27B dense and Qwen3.6-35B-A3B MoE. Primary source for the feature and the cross-hardware headline.
Caleb Coffie, “Benchmarking llama.cpp’s brand-new MTP support on Strix Halo” (2026-05-18): the reproducible post-merge run on a Framework Desktop (Ryzen AI Max+ 395, Radeon 8060S, 128 GB, ROCm 7.2.3). Source for the 35B-A3B 1.40x, 27B Q4 1.81x, and 27B Q8_0 2.44x figures and the bandwidth-ceiling explanation.
kyuz0 Strix Halo MTP benchmark grid: the tracking grid of record for Strix Halo MTP results (our per-topic authority for this hardware).
Outlier flagged, not used as a headline: Sleeping Robots, “MTP Speculative Decoding: 4.8x Faster Qwen 3.6 27B on Strix Halo” (2026-05-06). Its 4.8x rests on a hobbled baseline (quantized q4_0 KV cache, pre-merge branch); the reproducible range on a current build is 1.4x to 2.4x.
Strix Halo 128 GB price, memory, and bandwidth (256 GB/s): LLM Requirements dataset, verified 2026-07-01.
Companion report on the same lever on NVIDIA hardware: DGX Spark runs Qwen3.5-122B at 59 tok/s general and 81 on agent traffic with speculative decoding.