Best AI models that run on NVIDIA RTX Spark (128 GB)
Coming soon — DGX Spark silicon in Windows laptops and desktops, announced at Computex 2026.
Closed-frontier reference (May 2026)
What "API-grade" actually scores right now. Use this as the ceiling — anything local will lag here by some margin, and that's fine for most workflows.
Gemini 3.1 Pro
- HLE44.7%
- Terminal-Bench 280.2%
- SWE-Bench Pro54.2%
- SWE-Bench Ver80.6%
- Aider Polyglot—
- LiveCodeBench91.7%
- GPQA Diamond94.3%
- MMLU-Pro91.0%
Coding: Coding-strong, especially repo-level work with 2M context. Less of a Cursor/Cline default than Claude — Gemini Code Assist users prefer it. Top of leaderboards on GPQA + LCB. Now #7 on OpenRouter coding by volume (120B tokens) — Gemini 3.5 Flash will displace it next month.
Agent: TB2 80.2 makes it agent-grade. Used widely on Google's Agent Builder; less common in Open-Claude/Hermes. Reliable in long Vertex-AI Agent runs (multi-hour).
ChatGPT 5.5
- HLE52.2%
- Terminal-Bench 282.0%
- SWE-Bench Pro58.6%
- SWE-Bench Ver88.7%
- Aider Polyglot88.0%
- LiveCodeBench—
- GPQA Diamond93.6%
- MMLU-Pro—
Coding: Codex CLI + GPT-5.5 is the top of Terminal-Bench. r/ChatGPTCoding has shifted to it for daily coding; Cursor users mixed (some prefer Claude Sonnet 4.6 for diff quality).
Agent: Strongest published agent score (TB2 82.0%, re-verified 2026-05-14). Widely used in OpenAI Assistants, AutoGPT-style swarms, and Open-Claude routing. Reliable in 4-8h autonomous sessions.
Claude Sonnet 4.6
- HLE49.0%
- Terminal-Bench 253.4%
- SWE-Bench Pro—
- SWE-Bench Ver79.6%
- Aider Polyglot—
- LiveCodeBench—
- GPQA Diamond89.9%
- MMLU-Pro—
Coding: Cursor / Zed / Cline's default Claude model — better diff quality than GPT-5.5 per popular threads. Loved for its 1M context and lower hallucination rate. The 'work-horse' for daily coding. #6 on OpenRouter coding by token volume (1.77T) this week.
Agent: TB2 53.4 is mid-tier — fine for short-medium loops. Anthropic's official tool-use SDK makes it the most reliable closed model for hand-rolled Hermes / Open-Claude setups.
Claude Opus 4.8
- HLE57.9%
- Terminal-Bench 2—
- SWE-Bench Pro69.2%
- SWE-Bench Ver88.6%
- Aider Polyglot—
- LiveCodeBench—
- GPQA Diamond93.6%
- MMLU-Pro—
Coding: Cursor / Cline / Zed power-user pick when budget allows. SWE-Pro 69.2 and TB2.1 74.6 say it all — best closed model for real software engineering. Simon Willison describes it as 'a modest but tangible improvement' over Opus 4.7 (simonwillison.net/2026/May/28/claude-opus-4-8/). 4x less likely to miss code flaws vs predecessor.
Agent: Top closed agent with SWE-Pro 69.2 and new 'dynamic workflow' tooling (Claude Code 2.1.154). Powers production Hermes / Open-Claude setups. TB2.1 74.6 per Anthropic self-report — TB2.0 leaderboard not yet updated. Fast mode at $10/$50 reduces agentic cost vs standard rate.
Our picks for this build
Sourced from the State of Local AI snapshot — the model + quant + backend we'd actually deploy on this hardware today, with the recipe in the setup guide below.
Qwen 3.6 27B (dense)
Apr 22 2026. Dense 27B that hits 77.2% SWE-Bench Verified — beats much larger MoEs on coding. Vision-capable, 262 K native context. Best single-24 GB-card coder right now.
- HLE24.0%
- TB259.3%
- SWE-Pro53.5%
- SWE-Ver77.2%
Coding: The new local-coding king under 200B on r/LocalLLaMA — matches Claude Opus 4.5 on TB2 per Qwen's launch claims, beats Qwen3.5-397B-A17B on every coding eval. Daily-driver pick for Cline at Q4_K_M on a single Pro 6000 or M3 Ultra. Confirmed running ~160 tok/s with MTP on RTX 6000 per dzombak.com vLLM recipe.
Agent: Genuinely useful in Open-Claude / Claude Code routing — community reports 30-min+ sessions completing without derail. Still trails closed frontier on the very longest loops. Caps at agents:3 per site rule (sub-200B, TB2 59.3 below 65% threshold).
Qwen 3.6 35B-A3B (MoE)
Apr 2026 release. 35B / 3B active MoE — beats Gemma 4-31B on agentic coding, matches Sonnet on most vision tasks. Native 262 K context (extensible to 1 M), ~18 GB at Q4. The new local-coding king under 200 B.
- HLE21.4%
- TB251.5%
- SWE-Pro49.5%
- SWE-Ver73.4%
Coding: r/LocalLLaMA's pick for fast local coding on a 24 GB card at Q4_K_M — 3B active so it's snappy. Vibes-codes 'perfectly fine' in OpenCode/Claude Code per multiple weekly-megathreads. Simon Willison's pelican test (April 2026): 'Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7' — still resonating in the community.
Agent: Solid in 5-15 tool-hop loops in Cline. Long-horizon (60+ min) Open-Claude sessions still lose thread — 3B active is a ceiling on planning. Note: Qwen-self-reported TB2 51.5 vs community 23-24% — gap is harness-driven (Terminus-2 vs little-coder agent).
Mistral Medium 3.5 128B
Apr 30 2026. Western 128B dense with vision + 256 K context. 77.6% SWE-Bench Verified; first credible mid-tier open-weight from Mistral in months. Modified MIT.
- SWE-Ver77.6%
Coding: Apr 30 2026 launch with built-in PR-opening coding agent. Western 128B-dense with vision + 256K — early r/LocalLLaMA reports treat it as a credible Cline driver but trailing Qwen 3.6-27B on real refactors.
Agent: Mistral's agent SDK is OK; in Open-Claude it handles ~20-min sessions reliably. Long-horizon ceiling still unclear pending community evals.
Every open-weights model that fits, ranked by composite score
Composite blends benchmark averages (60 %) with editorial 0-5 ratings (40 %). Closed-frontier references mix into the ranking and stay amber-tinted.
| Model↕ | tg/s↕ | pp↕ | TTFT @ 100K↕ | HLE↕ | TB2↕ | SWE-Pro↕ | SWE-Ver↕ | Aider↕ | LCB↕ | GPQA↕ | MMLU-Pro↕ | Score↕ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
Qwen 3 Next 80B-A3B (MoE)80 B · 3 B active · moe🤗 | — | — | — | — | — | — | — | — | 78.4% | 77.2% | 82.7% | 4920 |
Mistral Medium 3.5 128B128 B · dense🤗 | — | — | — | — | — | — | 77.6% | — | — | — | — | 4689 |
GLM-4.5-Air 106B (MoE)106 B · 12 B active · moe🤗 | — | — | — | — | — | — | 57.6% | — | — | — | — | 4488 |
DiffusionGemma 26B-A4B26 B · 4 B active · diffusion-moe🤗 | — | — | — | — | — | — | — | — | 69.1% | 73.2% | 77.6% | 4358 |
Mistral Small 4 119B-A6B (MoE)119 B · 6 B active · moe🤗 | — | — | — | — | — | — | — | — | — | 71.2% | — | 4301 |
Qwen 3.6 27B (dense)27 B · dense🤗 | — | — | — | 24.0% | 59.3% | 53.5% | 77.2% | — | 83.9% | 87.8% | 86.2% | 4280 |
Qwen 3 32B32 B · dense🤗 | — | — | — | — | — | — | — | — | — | 65.7% | 65.5% | 4278 |
DeepSeek R1 Distill 70B70 B · dense🤗 | — | — | — | — | — | — | — | — | 57.5% | 65.2% | 84.0% | 4171 |
| — | — | — | — | — | — | — | — | — | 56.1% | 70.4% | 4160 | |
Qwen 3.6 35B-A3B (MoE)35 B · 3 B active · moe🤗 | — | — | — | 21.4% | 51.5% | 49.5% | 73.4% | — | 80.4% | 86.0% | 85.2% | 4084 |
Qwen 3.5 122B-A10B (MoE)122 B · 10 B active · moe🤗 | — | — | — | 25.3% | 49.4% | — | 72.0% | — | 78.9% | 86.6% | 86.7% | 4021 |
NVIDIA Nemotron 3 Super 120B-A12B (MoE)120 B · 12 B active · moe🤗 | — | — | — | 18.3% | 31.0% | — | 60.5% | — | 81.2% | 79.2% | 83.7% | 3839 |
Gemma 4 31B (dense)31 B · dense🤗 | — | — | — | 19.5% | 42.9% | 35.7% | 52.0% | — | 80.0% | 84.3% | 85.2% | 3697 |
GPT-OSS 120B120 B · 5 B active · moe🤗 | — | — | — | 18.5% | 18.7% | 16.2% | 62.4% | — | 87.8% | 80.9% | 90.0% | 3573 |
Mistral Small 3 24B24 B · dense🤗 | — | — | — | — | — | — | — | — | — | 45.3% | 66.0% | 3361 |
Gemma 3 27B27 B · dense🤗 | — | — | — | — | — | — | — | — | — | 42.4% | 67.5% | 3321 |
Llama 4 Scout 109B-A17B (MoE)109 B · 17 B active · moe🤗 | — | — | — | — | — | — | — | — | 32.8% | 57.2% | 74.3% | 3315 |
Devstral 2 123B (dense)123 B · dense🤗 | — | — | — | — | 32.6% | — | 72.2% | — | — | — | — | 3174 |
Gemma 4 26B-A4B (MoE)26 B · 4 B active · moe🤗 | — | — | — | 8.7% | 34.2% | 13.8% | 17.4% | — | 77.1% | 82.3% | 82.6% | 3060 |
Qwen 3 Coder 30B-A3B (MoE)30 B · 3 B active · moe🤗 | — | — | — | — | — | — | 50.3% | — | — | — | — | 3042 |
Llama 3.3 70B70 B · dense🤗 | — | — | — | — | — | — | — | — | 28.8% | 50.5% | 68.9% | 2990 |
Llama 3.1 8B8 B · dense🤗 | — | — | — | — | — | — | — | — | — | 34.6% | 49.0% | 2527 |
| — | — | — | 2.8% | — | — | — | — | — | 47.0% | 65.5% | 2325 | |
| Gemini 3.1 ProGoogle DeepMind · closed | 125 t/s | — | 2.1 min | 44.7% | 80.2% | 54.2% | 80.6% | — | 91.7% | 94.3% | 91.0% | — |
| ChatGPT 5.5OpenAI · closed | 61 t/s | — | 1.6 min | 52.2% | 82.0% | 58.6% | 88.7% | 88.0% | — | 93.6% | — | — |
| Claude Sonnet 4.6Anthropic · closed | 45 t/s | — | 3.0 s | 49.0% | 53.4% | — | 79.6% | — | — | 89.9% | — | — |
| Claude Opus 4.8Anthropic · closed | 58 t/s | — | 2.9 min | 57.9% | — | 69.2% | 88.6% | — | — | 93.6% | — | — |
Last updated 2026-06-13