Best AI models that run on NVIDIA RTX Spark (128 GB)

Coming soon — DGX Spark silicon in Windows laptops and desktops, announced at Computex 2026.

NVIDIA · OEM laptops + small desktops

NVIDIA RTX Spark (128 GB)

128 GB 119 GB usable 300 GB/s —

Coming soon — not shipping yet. The specs above are NVIDIA's announced figures, not measured numbers. OEM systems are expected later in 2026; we'll fill in pricing, retailers, and real tokens-per-second benchmarks as review units land. Until then, the DGX Spark — built on the same GB10-class silicon — is the closest shipping proxy for what this platform can run.

What models fit this build

NVIDIA RTX Spark (128 GB) has 119 GB of usable memory. Here's which open-weights model sizes fit at each quant, at ~8K context with an FP8 KV cache.

Largest comfortable fit: ~235B MoE at Q2 (~100 GB).

Model size	Q2	Q4	Q5	Q8
7–8B	Fits	Fits	Fits	Fits
13–14B	Fits	Fits	Fits	Fits
30–32B	Fits	Fits	Fits	Fits
70–72B	Fits	Fits	Fits	Fits
~120B MoE	Fits	Fits	Fits	Won't fit
~235B MoE	Fits	Won't fit	Won't fit	Won't fit
~670B MoE	Won't fit	Won't fit	Won't fit	Won't fit
1T+ MoE	Won't fit	Won't fit	Won't fit	Won't fit

✓ = the weights + FP8 KV cache fit within this build's usable memory at ~8K context. Longer context needs more — size any model in the picker →

Closed-frontier reference (May 2026)

What "API-grade" actually scores right now. Use this as the ceiling — anything local will lag here by some margin, and that's fine for most workflows.

Google DeepMind 2026-02

Gemini 3.1 Pro

2.0M ctx $2 / $12 per M tok

API: 125 t/s out · +30 s thinking

HLE44.7%
Terminal-Bench 280.2%
SWE-Bench Pro54.2%
SWE-Bench Ver80.6%
Aider Polyglot—
LiveCodeBench91.7%
GPQA Diamond94.3%
MMLU-Pro91.0%

Coding: Coding-strong, especially repo-level work with 2M context. Less of a Cursor/Cline default than Claude — Gemini Code Assist users prefer it. Top of leaderboards on GPQA + LCB. Now #7 on OpenRouter coding by volume (120B tokens) — Gemini 3.5 Flash will displace it next month.

Agent: TB2 80.2 makes it agent-grade. Used widely on Google's Agent Builder; less common in Open-Claude/Hermes. Reliable in long Vertex-AI Agent runs (multi-hour).

OpenAI 2026-04

ChatGPT 5.5

922K ctx $5 / $30 per M tok

API: 61 t/s out · +28 s thinking

HLE52.2%
Terminal-Bench 282.0%
SWE-Bench Pro58.6%
SWE-Bench Ver88.7%
Aider Polyglot88.0%
LiveCodeBench—
GPQA Diamond93.6%
MMLU-Pro—

Coding: Codex CLI + GPT-5.5 is the top of Terminal-Bench. r/ChatGPTCoding has shifted to it for daily coding; Cursor users mixed (some prefer Claude Sonnet 4.6 for diff quality).

Agent: Strongest published agent score (TB2 82.0%, re-verified 2026-05-14). Widely used in OpenAI Assistants, AutoGPT-style swarms, and Open-Claude routing. Reliable in 4-8h autonomous sessions.

Anthropic 2026-06

Claude Sonnet 5

1.0M ctx $3 / $15 per M tok

HLE57.4%
Terminal-Bench 280.4%
SWE-Bench Pro63.2%
SWE-Bench Ver—
Aider Polyglot—
LiveCodeBench—
GPQA Diamond—
MMLU-Pro—

Coding: Anthropic's 2026-06-30 release: near-Opus intelligence at the old Sonnet price. Terminal-Bench 2.1 jumps to 80.4 (from Sonnet 4.6's 53.4), putting it in the top tier of agentic coders next to GPT-5.5 (82) and Gemini 3.1 Pro (80.2). 1M context, adaptive thinking on by default; the daily workhorse for Cursor / Zed / Cline users who want closed-model diff quality.

Agent: TB2 80.4 makes Sonnet 5 a top-tier agent driver (Sonnet 4.6 was mid-tier at 53.4). SWE-Bench Pro 63.2 lands between GPT-5.5 (58.6) and Opus 4.7 (69.2). Anthropic's tool-use SDK keeps it the most reliable closed model for hand-rolled agent loops.

Anthropic 2026-05

Claude Opus 4.8

1.0M ctx $5 / $25 per M tok

API: 59 t/s out · +22 s thinking

HLE57.9%
Terminal-Bench 2—
SWE-Bench Pro69.2%
SWE-Bench Ver88.6%
Aider Polyglot—
LiveCodeBench—
GPQA Diamond93.6%
MMLU-Pro—

Coding: Cursor / Cline / Zed power-user pick when budget allows. SWE-Pro 69.2 and TB2.1 74.6 say it all — best closed model for real software engineering. Simon Willison describes it as 'a modest but tangible improvement' over Opus 4.7 (simonwillison.net/2026/May/28/claude-opus-4-8/). 4x less likely to miss code flaws vs predecessor.

Agent: Top closed agent with SWE-Pro 69.2 and new 'dynamic workflow' tooling (Claude Code 2.1.154). Powers production Hermes / Open-Claude setups. TB2.1 74.6 per Anthropic self-report — TB2.0 leaderboard not yet updated. Fast mode at $10/$50 reduces agentic cost vs standard rate.

Our picks for this build

Sourced from the State of Local AI snapshot — the model + quant + backend we'd actually deploy on this hardware today, with the recipe in the setup guide below.

Best dense

Qwen 3.6 27B (dense)

27 B Apache 2.0

Apr 22 2026. Dense 27B that hits 77.2% SWE-Bench Verified — beats much larger MoEs on coding. Vision-capable, 262 K native context. Best single-24 GB-card coder right now.

≥20 GB Q4

HLE24.0%
TB259.3%
SWE-Pro53.5%
SWE-Ver77.2%

Coding: The new local-coding king under 200B on r/LocalLLaMA — matches Claude Opus 4.5 on TB2 per Qwen's launch claims, beats Qwen3.5-397B-A17B on every coding eval. Daily-driver pick for Cline at Q4_K_M on a single Pro 6000 or M3 Ultra. Confirmed running ~160 tok/s with MTP on RTX 6000 per dzombak.com vLLM recipe.

Agent: Genuinely useful in Open-Claude / Claude Code routing — community reports 30-min+ sessions completing without derail. Still trails closed frontier on the very longest loops. Caps at agents:3 per site rule (sub-200B, TB2 59.3 below 65% threshold).

Best MoE that fits

Qwen 3.6 35B-A3B (MoE)

35 B · 3B active Apache 2.0

Apr 2026 release. 35B / 3B active MoE — beats Gemma 4-31B on agentic coding, matches Sonnet on most vision tasks. Native 262 K context (extensible to 1 M), ~18 GB at Q4. The new local-coding king under 200 B.

≥22 GB Q4

HLE21.4%
TB251.5%
SWE-Pro49.5%
SWE-Ver73.4%

Coding: r/LocalLLaMA's pick for fast local coding on a 24 GB card at Q4_K_M — 3B active so it's snappy. Vibes-codes 'perfectly fine' in OpenCode/Claude Code per multiple weekly-megathreads. Simon Willison's pelican test (April 2026): 'Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7' — still resonating in the community.

Agent: Solid in 5-15 tool-hop loops in Cline. Long-horizon (60+ min) Open-Claude sessions still lose thread — 3B active is a ceiling on planning. Note: Qwen-self-reported TB2 51.5 vs community 23-24% — gap is harness-driven (Terminus-2 vs little-coder agent).

Dense runner-up

Mistral Medium 3.5 128B

128 B Modified MIT

Apr 30 2026. Western 128B dense with vision + 256 K context. 77.6% SWE-Bench Verified; first credible mid-tier open-weight from Mistral in months. Modified MIT.

≥80 GB Q4

SWE-Ver77.6%

Coding: Apr 30 2026 launch with built-in PR-opening coding agent. Western 128B-dense with vision + 256K — early r/LocalLLaMA reports treat it as a credible Cline driver but trailing Qwen 3.6-27B on real refactors.

Agent: Mistral's agent SDK is OK; in Open-Claude it handles ~20-min sessions reliably. Long-horizon ceiling still unclear pending community evals.

Every open-weights model that fits, ranked by composite score

Composite blends benchmark averages (60 %) with editorial 0-5 ratings (40 %). Closed-frontier references mix into the ranking and stay amber-tinted.

Model	tg/s	pp	TTFT @ 100K	HLE	TB2	SWE-Pro	SWE-Ver	Aider	LCB	GPQA	MMLU-Pro	Score
Qwen 3 Next 80B-A3B (MoE)80 B · 3 B active · moe	—	—	—	—	—	—	—	—	78.4%	77.2%	82.7%	4920
Mistral Medium 3.5 128B128 B · dense	—	—	—	—	—	—	77.6%	—	—	—	—	4689
Qwen 3.5 9B9 B · dense	—	—	—	—	—	—	—	—	65.6%	81.7%	82.5%	4623
GLM-4.5-Air 106B (MoE)106 B · 12 B active · moe	—	—	—	—	—	—	57.6%	—	—	—	—	4488
DiffusionGemma 26B-A4B26 B · 4 B active · diffusion-moe	—	—	—	—	—	—	—	—	69.1%	73.2%	77.6%	4358
Mistral Small 4 119B-A6B (MoE)119 B · 6 B active · moe	—	—	—	—	—	—	—	—	—	71.2%	—	4301
Qwen 3.6 27B (dense)27 B · dense	—	—	—	24.0%	59.3%	53.5%	77.2%	—	83.9%	87.8%	86.2%	4280
Qwen 3 32B32 B · dense	—	—	—	—	—	—	—	—	—	65.7%	65.5%	4278
DeepSeek R1 Distill 70B70 B · dense	—	—	—	—	—	—	—	—	57.5%	65.2%	84.0%	4171
Phi-4 14B14 B · dense	—	—	—	—	—	—	—	—	—	56.1%	70.4%	4160
Qwen 3.6 35B-A3B (MoE)35 B · 3 B active · moe	—	—	—	21.4%	51.5%	49.5%	73.4%	—	80.4%	86.0%	85.2%	4084
Qwen 3.5 122B-A10B (MoE)122 B · 10 B active · moe	—	—	—	25.3%	49.4%	—	72.0%	—	78.9%	86.6%	86.7%	4021
NVIDIA Nemotron 3 Super 120B-A12B (MoE)120 B · 12 B active · moe	—	—	—	18.3%	31.0%	—	60.5%	—	81.2%	79.2%	83.7%	3839
Gemma 4 12B Unified (dense)12 B · dense	—	—	—	5.2%	—	—	—	—	72.0%	78.8%	77.2%	3759
Gemma 4 31B (dense)31 B · dense	—	—	—	19.5%	42.9%	35.7%	52.0%	—	80.0%	84.3%	85.2%	3697
GPT-OSS 120B120 B · 5 B active · moe	—	—	—	18.5%	18.7%	16.2%	62.4%	—	87.8%	80.9%	90.0%	3573
Mistral Small 3 24B24 B · dense	—	—	—	—	—	—	—	—	—	45.3%	66.0%	3361
Gemma 3 27B27 B · dense	—	—	—	—	—	—	—	—	—	42.4%	67.5%	3321
Llama 4 Scout 109B-A17B (MoE)109 B · 17 B active · moe	—	—	—	—	—	—	—	—	32.8%	57.2%	74.3%	3315
Devstral 2 123B (dense)123 B · dense	—	—	—	—	32.6%	—	72.2%	—	—	—	—	3174
Gemma 4 26B-A4B (MoE)26 B · 4 B active · moe	—	—	—	8.7%	34.2%	13.8%	17.4%	—	77.1%	82.3%	82.6%	3060
Qwen 3 Coder 30B-A3B (MoE)30 B · 3 B active · moe	—	—	—	—	—	—	50.3%	—	—	—	—	3042
Llama 3.3 70B70 B · dense	—	—	—	—	—	—	—	—	28.8%	50.5%	68.9%	2990
Llama 3.1 8B8 B · dense	—	—	—	—	—	—	—	—	—	34.6%	49.0%	2527
Qwen 3 8B8 B · dense	—	—	—	2.8%	—	—	—	—	—	47.0%	65.5%	2325
Gemini 3.1 ProGoogle DeepMind · closed	125 t/s	—	2.1 min	44.7%	80.2%	54.2%	80.6%	—	91.7%	94.3%	91.0%	—
ChatGPT 5.5OpenAI · closed	61 t/s	—	1.6 min	52.2%	82.0%	58.6%	88.7%	88.0%	—	93.6%	—	—
Claude Sonnet 5Anthropic · closed	—	—	—	57.4%	80.4%	63.2%	—	—	—	—	—	—
Claude Opus 4.8Anthropic · closed	59 t/s	—	2.9 min	57.9%	—	69.2%	88.6%	—	—	93.6%	—	—

Open in the live picker (Q2 / Q4 / Q5 / Q8 toggles) → Try other hardware → Submit a benchmark for NVIDIA RTX Spark (128 GB)

Last updated 2026-07-11