All news

Xiaomi MiMo V2.5 on 2x DGX Spark: 32 tok/s single, 184 tok/s at 8 concurrent, 97.8 tool-quality (measured)

MiMo V2.5 is a 310B/15B-active omnimodal MoE from Xiaomi that reached ~13% of OpenRouter's token traffic in May, and MIT open weights. We ran the NVFP4 build on our 2-node DGX Spark cluster: ~32 tok/s single-stream, 184 tok/s at 8 concurrent, 1M context, 97.8/100 on a 69-scenario tool eval. Two honest catches inside.

Xiaomi’s MiMo models went from nothing to about 13% of OpenRouter’s token traffic in a year, and in May 2026 MiMo-V2-Pro was the single most-used model on the platform (OpenRouter May 2026 rankings). Unlike most models near the top of that chart, the MiMo weights are MIT open. So for the omnimodal 310B sibling, MiMo V2.5, the question stops being “is it good” and becomes “can I run it myself, and how fast.” We put it on our own two-box DGX Spark cluster to find out.

MiMo V2.5 is a 310B-total, 15B-active sparse MoE (256 experts, top-8 per token), and it is natively omnimodal: text, image, video, and audio through the same model, with a 1M-token context (HF model card). The 15B active count is the whole story for local hardware. It means a 310B model decodes at the speed of a small one, while the 184 GB of NVFP4 weights need a machine with room to hold them. A single 128 GB DGX Spark cannot; two of them, pooled to 256 GB over RoCE, can.

We served lukealonso/MiMo-V2.5-NVFP4 (NVFP4 experts + MXFP8 dense + the Omni tower, ~184 GB) on 2x DGX Spark (GB10), TP=2, vLLM, using the tonyd2wild 2-Spark recipe and its patched image. Numbers below are decode-phase, server-side (client SSE under-reports ~40% on this stack), and every one carries its regime.

Single-stream: ~32 tok/s

The tuned recipe our cluster runs settles at ~32 tok/s single-stream across output lengths, with MTP1 speculative decoding accepting ~92% of drafted tokens:

Output tokensServer tok/sMTP acceptance
51232.130.928
1,02431.870.927
2,04831.890.926

That is at shallow context. This is read-along pace, comfortable for chat and a single agent loop, not the sub-second-per-screen feel of a hosted API. Three independent 2-Spark measurements bracket the same band: the NVIDIA developer forum recipe thread reports 34.14 tok/s single-stream (a3refaat) and a 33.5 to 41.9 range across tasks (eparin82) on the sibling eugr build (thread #370459, May 18), and a separate step-by-step dredyson.com writeup lands at ~32 to 33. Our own first pass on the cluster measured a bit lower, ~28 tok/s, before we matched the recipe’s fully-tuned config; the gap was tuning, not hardware.

8 concurrent: 184 tok/s aggregate (23 per stream)

Send more requests at once and the shared batch fills up. Aggregate throughput climbs while per-stream speed drops, the normal MoE-serving trade:

ConcurrencyAggregate tok/sPer-stream tok/sAcceptance
260.230.10.829
494.723.70.837
6141.823.60.832
8184.123.00.867

The 184 tok/s at 8 concurrent is shared-batch aggregate, not per-user speed. If you are running one agent, plan around ~32 tok/s; if you are fanning out eight, plan around ~23 each. These are short-to-moderate requests on a 1M-capable server, not eight simultaneous full-1M streams (the memory pool holds roughly two of those).

Quality: 97.8 on a 69-scenario tool eval, thinking off

Speed is only half of “can I use this.” On a 69-scenario tool-calling eval the NVFP4 build scored 97.8/100 with thinking off (66 pass, 3 partial, 0 fail), and, notably, the score was identical at 500K and at 1M context. The 1M ceiling costs nothing on quality.

Turning thinking on made it worse for tool work: 90.6/100 with 5 outright failures, at roughly twice the end-to-end latency from the extra reasoning tokens. For agents and tool loops, run MiMo thinking-off. Save thinking-on for open-ended reasoning where the extra tokens earn their keep. On coding and agentic benchmarks the model is strong in its class: Terminal-Bench 2.0 65.8 and SWE-Bench Pro 56.1 (from the model-card charts, cross-checked in our dataset), which is why it lands where it does on the agentic leaderboards despite the small active-parameter count.

The two honest catches

Depth. The ~32 tok/s figure is at shallow context. Push the context deep and single-stream decode falls off hard: on our cluster it dropped to ~13 tok/s at 100K depth. The 1M context is real and it boots (the KV pool holds ~2.17 million tokens), but treat deep-context single-stream as a slow-and-steady mode, not the headline speed.

It is a light-batch driver, not a fleet server. The 2.17M-token pool fits about two full-1M requests, or around twenty concurrent 100K-context agents, before KV memory bites. That makes a 2-Spark MiMo a fast single-user or small-team agent box, not a many-simultaneous-full-context production server. For that you want more nodes: a 3-Spark TP=3 setup reaches ~39 tok/s single-stream (forum #373968), and the payoff there is context and concurrency headroom, not a big single-stream jump.

The recipe

Stock vLLM cannot run this. MiMo V2.5 on GB10 needs a patched dev build plus loader mods (NVFP4 KV cache, a Triton DiffKV attention backend, the MiMoV2Omni architecture, an MXFP8 mixed-precision fix). The cleanest path is the published tonyd2wild image (ghcr.io/tonyd2wild/mimo-v2.5-tp2-1m-nvfp4kv:20260620, ~20 GB) with its six runtime mods applied on both nodes. Core serve flags:

--tensor-parallel-size 2 --distributed-executor-backend ray \
--kv-cache-dtype nvfp4 --attention-backend triton_attn_diffkv \
--max-model-len 1000000 --max-num-seqs 8 \
--gpu-memory-utilization 0.84 \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
--load-format safetensors \
--hf-overrides '{"architectures":["MiMoV2OmniForCausalLM"]}' \
--tool-call-parser mimo --reasoning-parser mimo --enable-auto-tool-choice

A few things that cost round-trips if you skip them:

  • Ray is required. vLLM’s multiprocessing executor is single-host; a 2-box tensor-parallel split needs Ray. Start the worker container and Ray first, then the head, then launch vLLM from the head once Ray sees both GPUs. Cap the Ray object store at 1 GiB per node or it steals unified memory and OOMs on weight load.
  • Pin the host IP to each node’s RoCE address. If Ray binds a link-local 169.254.x.x interface, the cluster freezes.
  • MTP1 beats MTP2 here. MTP2 gives no decode gain and nearly halves the KV pool. Use MTP1.
  • safetensors, not InstantTensor. The MTP plus NVFP4-KV path wedges on the drafter load under InstantTensor.
  • GPU memory utilization tops out at 0.89. 0.90 trips the GB10 startup guard.

Why it matters for a GB10 buyer

The pitch of a 2-box GB10 cluster is that a frontier-class open model runs on hardware you own, at your desk, for a one-time cost. MiMo V2.5 is a real test of that pitch, and it mostly passes: ~32 tok/s of a 97.8-quality omnimodal agent, offline, MIT-licensed, with the option to push context to 1M.

The cost is where the decision actually turns. A DGX Spark Founders Edition is $4,699 per box (128 GB, 119 usable, 273 GB/s), so a two-node cluster runs about $9,500 (prices verified 2026-06-27). If you care about the GB10 chip rather than the NVIDIA badge, the same silicon sits in cheaper boxes, and an ASUS Ascent GX10 at ~$3,499 changes the math on a two-box build. Whether ~32 tok/s single-stream is enough depends entirely on your workload: plenty for one coding agent or a chat assistant, tight for a team hammering it in parallel. If you want to see which open models actually fit a given machine and how fast they should run on it, our picker matches every model against your exact memory and bandwidth and shows the expected speed, and the DGX Spark hardware page has the full per-model breakdown and where to buy.

For the other frontier MoE we run on the same box, see our DeepSeek V4-Flash on 2x DGX Spark measurements, and for the NVFP4 speed story on a single Spark, Qwen 3.6 35B-A3B at 97 tok/s.

Sources