DGX Spark runs Qwen3.5-122B at 59 tok/s general and 81 on agent traffic, with speculative decoding (not NVFP4)

By LLM Requirements · June 30, 2026

#performance #nvidia #dgx-spark #speculative-decoding #qwen #agentic

On a single DGX Spark, Qwen3.5-122B-A10B with DFlash block-speculative decoding runs about 59 tok/s on general decode and about 81 tok/s on real agent and tool-call traffic. For agentic work the lever is speculative decoding, not the stock low-50s decode reviews quote, and not NVFP4 (no measured win on this model).

As of June 2026.

If you are sizing up a DGX Spark to run a 120B-class model for agentic or coding work, the decode number most reviews quote, low-50s tok/s out of the box, is the wrong number to judge it on. It is the figure that makes a $4,699 box look slow, and most reviews stop there. The model people actually load on this hardware is Qwen3.5-122B-A10B (122B total parameters, about 10B active per token, a mixture-of-experts model), and on a single Spark it answers a sharper question than “how fast does it decode”: how fast does it run the agent loop you bought it for. The answer, with the right recipe, is about 59 tok/s on general single-stream decode and about 81 tok/s on real tool-call traffic. The lever that gets there is speculative decoding, not the hardware and not NVFP4.

This is the 122B half of a two-part look at speculative decoding on the DGX Spark. The other half covers DeepSeek V4 Flash and what the DSpark method really does. Same point in both: on a 273 GB/s box, the throughput you can plan around comes from how many tokens you verify per forward pass, not from a quant format.

The measured figures

All single-stream, single Spark, from the canonical NVIDIA developer-forum thread for this exact model (reporter entrpi, 2026-06-24), corroborated by our own first-party run on the box. The recipe is the Intel/Qwen3.5-122B-A10B-int4-AutoRound checkpoint (INT4 weights) served under vLLM 0.23 with the z-lab/Qwen3.5-122B-A10B-DFlash 0.8B block-diffusion drafter at speculative depth n=12, plus hybrid INT4-routed / FP8-shared experts and an int8 LM head.

Regime	Qwen3.5-122B-A10B on one DGX Spark
Plain INT4, no speculative decoding	28.2 tok/s
Public MTP-2 stack (the dataset floor)	~51 tok/s
DFlash, general single-stream decode	~59 tok/s
DFlash, real agent and tool-call traffic (~100K context)	~81 tok/s

Two numbers, two workloads, and the gap between them is the whole story. The ~59 is a neutral general-decode harness (the worst case you should plan around for free-form generation). The ~81 is real agent turns: 73 percent tool-calls, with the drafter accepting on average about 8.3 tokens per verification step. Agentic traffic is structured and predictable, which is exactly the workload speculative decoding eats for breakfast, so the number on the job goes up, not down. We carry both because publishing the 81 alone, as a bare decode figure, would be the same mistake the early write-ups made: a workload-specific peak dressed up as a general number.

Why the agent number is the one that matters

A 120B-class model on a Spark is not a chatbot you watch type. It is the brain in a coding agent or a tool-using pipeline, where the model emits a tool call, reads the result, and emits the next one, over a context that grows into six figures. That is the ~81 tok/s regime, measured at about 100K context, and it is the number that decides whether the loop feels responsive. The general ~59 is the floor you fall back to on free-form prose. Most reviews never measure either with speculative decoding wired up, so the Spark gets judged on its slowest configuration doing the workload it is least suited to.

What actually moved the number

The single biggest lever is speculative decoding. Qwen3.5-122B-A10B activates only about 10B of its 122B parameters per token, so on a 273 GB/s memory bus it is bandwidth-bound: the box spends most of its time moving weights, not computing. Speculative decoding attacks exactly that. A small, cheap drafter guesses a block of tokens ahead, and the big model verifies them in a single forward pass, so the memory bus does less work per accepted token. DFlash is a block-diffusion drafter built for this case, and at n=12 with structured agent traffic it accepts long runs (about 8.3 tokens per step), which is why the agent number clears 80.

What did not move the number is NVFP4. NVIDIA’s FP4 format gets a lot of billing because GB10 has the Blackwell tensor cores for it, but on this specific model there is no measured speculative-decoding win from it. The author of the forum benchmark says plainly he has not produced an NVFP4 build for this model that beats the INT4 plus DFlash path. Our own first-party result agrees: NVFP4 underdelivered on the 122B. The recipe that works today is INT4 weights plus a good drafter, not a quant-format swap. (NVFP4 is a real win on the smaller Qwen 3.6 35B-A3B, which is a separate result; this finding is about the 122B.)

For comparison, the same hardware with no speculative decoding at all decodes at 28.2 tok/s. The publicly shared MTP-2 stack reaches about 51 tok/s, which is the floor in our hardware dataset. DFlash takes the same box to ~59 general and ~81 on agent traffic. That is a roughly 2x to 3x swing from software, on hardware you have already paid for.

How the model fits the box at all

Qwen3.5-122B-A10B in BF16 is about 234 GB, far past any single 128 GB machine. A 4-bit build (the INT4 AutoRound checkpoint here) fits it into the Spark’s unified 128 GB pool with room left for a long KV cache, which is what lets the model hold a 100K-plus agent context at all. This is why it is a Spark story rather than a desktop-GPU story: a 32 GB RTX 5090 or a 48 GB workstation card cannot load a 120B-class MoE, tuned or not. You need a large memory pool, unified or multi-GPU.

How it lines up against the boxes that fit it

Most hardware people cross-shop the Spark against cannot load a 120B-class MoE at all. Here are the capacity builds that can. One honest caveat up front: the Spark column carries its measured speculative-decoding result because that recipe is verified today, while the AMD figures are our dataset’s llama.cpp floors without their own speculative-decoding tuning. Read the AMD numbers as floors, not ceilings.

Build	Price	Memory (usable)	Bandwidth	122B-A10B tok/s
DGX Spark 128 GB	$4,699	128 GB (119)	273 GB/s	~51 public MTP, ~59 general / ~81 agent (DFlash)
Strix Halo 128 GB	$2,799	128 GB (96)	256 GB/s	47 (dataset floor)
Dual R9700 64 GB	$3,700	64 GB (62)	640 GB/s	40 (dataset floor)
Dual W7900 96 GB	$9,200	96 GB (92)	864 GB/s	28 (dataset floor)

The point of the table is not that the Spark wins a drag race; it is that the Spark is the one box here with a public, reproducible recipe that gets a 120B-class MoE past the bandwidth wall today. A tuned Strix Halo would also rise above its 47 floor with the same kind of work, and the AMD boxes have more bandwidth per card to give. We have not measured those tuned ceilings first-party yet. First-party R9700 numbers are next on our bench, and we will publish measured AMD figures rather than dataset floors when they land.

Reproduce it

The recipe is public. The target checkpoint is Intel/Qwen3.5-122B-A10B-int4-AutoRound (INT4 weights, bf16 KV). The drafter is z-lab/Qwen3.5-122B-A10B-DFlash, an 0.8B block-diffusion model, run at speculative depth n=12. Serve both under vLLM 0.23 on a current sm121 (GB10) image, set the context to 262K, and optionally add the dense levers (hybrid INT4-routed plus FP8-shared experts and an int8 LM head) for the last few percent. The full thread, with the exact harness and per-run numbers, is the NVIDIA developer-forum post linked in the sources. Run it on a current vLLM build, not an early-2026 snapshot, since the GB10 kernel path matured through this year.

Caveats

This is a throughput report, not a capability ranking. We measured how fast the model runs on this box, not how well it reasons. Pick the model on its merits; this piece is about the speed you can expect once it is tuned.
Carry both numbers. ~59 is general single-stream decode; ~81 is real agent and tool-call traffic. The 81 is workload-specific, not a universal peak. Quoting a flat 80 as a decode number is the error this rewrite exists to correct.
Speculative decoding is the lever, not NVFP4. On this model the win comes from the DFlash drafter on INT4 weights. NVFP4 has no measured speculative-decoding advantage here yet.
Single box, single stream. These are one request at a time with the whole GPU dedicated to it. Per-stream throughput drops under concurrency, so plan capacity by your concurrency budget.
Long-context prefill is the honest ceiling. Short-context decode is strong; multi-second time-to-first-token on very large prompts is the limit of a 273 GB/s box, not an API. The agent number above already runs at about 100K context, so it reflects real long-context behavior, not a short-prompt best case.

If you are choosing which model to run rather than which box, the model picker matches every open model against your machine’s memory and shows the builds that actually fit it, with these same measured numbers, so you can see whether a 120B-class MoE is in reach before you spend anything.

Sources

NVIDIA Developer Forums: DFlash for Qwen3.5-122B-A10B, ~80 tok/s on a single Spark (reporter entrpi, 2026-06-24): the ~59 general decode, ~81 agentic, 28.2 plain-INT4 and 51.58 MTP-2 figures, the full recipe (INT4 AutoRound plus DFlash n=12, hybrid experts, int8 LM head, vLLM 0.23, 262K context), and the note that no NVFP4 build for this model has been measured. Corroborated by an LLM Requirements first-party run on our own Spark.
Intel/Qwen3.5-122B-A10B-int4-AutoRound model card (the served INT4 checkpoint).
z-lab/Qwen3.5-122B-A10B-DFlash model card (the 0.8B block-diffusion drafter).
Qwen/Qwen3.5-122B-A10B base model card (122B total / ~10B active; ~234 GB BF16).
NVFP4 + Qwen 3.6 35B-A3B on DGX Spark: 97 tok/s: our companion report on the smaller 35B-A3B model, where NVFP4 is a genuine win (a separate result from this 122B finding).
DeepSeek V4 Flash on 2x DGX Spark, and what the DSpark method really does: the sibling half of this speculative-decoding pair, covering the DeepSeek model across two Sparks.
Comparison build figures (DGX Spark, Strix Halo, dual R9700, dual W7900 prices, memory, bandwidth, and dataset 120B-MoE floors): LLM Requirements dataset, prices verified 2026-06-28.