What hardware do I need to run AI locally?

Memory is the number that decides what you can run. 8 to 16 GB of RAM runs a small 8B model for casual chat at no cost on hardware you already own. A serious coding assistant wants 20 to 48 GB (a used RTX 3090, an RTX 5090, or a MacBook M5 Pro). Agents, long documents and vision want 60 to 192 GB (a 128 GB Strix Halo, a DGX Spark, or an M5 Max). Frontier-class open models need 192 GB and up, which means clusters.

How do I start running AI locally for free?

Install one app (LM Studio for a point-and-click window, or Ollama for a terminal), download one model that fits your memory (Qwen3 8B at Q4_K_M on 8 to 16 GB, Qwen 3.6 27B at Q4_K_M on a 24 GB GPU), load it and type a prompt. The whole setup takes about ten minutes on hardware you already own.

What does quantization mean and which one should I use?

Quantization stores a model's weights with fewer bits so it fits smaller hardware. Q4_K_M (4-bit) is what most people run at 92 to 95 percent of full quality; Q5_K_M (5-bit) is the default sweet spot at 96 to 98 percent; Q8_0 (8-bit) is effectively indistinguishable from full precision if you have the room. The rule: Q4_K_M when memory is tight, Q5_K_M as the default, Q8_0 when you have room to spare.

All guides

How to Run AI Locally: The Complete Beginner's Guide (2026)

Running AI on your own machine stopped being a hobbyist stunt about two years ago. In 2026 a model you download for free writes code at a level that needed a $20-per-month subscription not long ago, and the hardware to run it ranges from the laptop you already own to a small cluster in a closet. The catch is that the field is full of jargon (GGUF, quantization, MoE, ROCm) and the first hour is where most people give up.

Underneath the jargon, it comes down to three things: the hardware to run it on, a runner to load and serve the model, and a harness to put it to work. Here is each one — what it is, your options, and the step to take — then the rest of the guide goes deep on all three. Every number comes from our own dataset (last updated 2026-06-13) or a linked source.

Already comfortable with local AI? Read our State of Local AI snapshot instead — the best models and hardware at each budget right now.

The models

A model is the AI itself — the part that actually writes the answer. ChatGPT, Claude and Gemini are products built on closed models you rent over the internet. Open models — Qwen, Llama, Gemma, DeepSeek — are the ones you download and run yourself, for free, with nothing leaving your machine. They are not as capable as the latest closed frontier, and usually smaller, but they handle everyday questions, writing and a lot of coding without trouble. Each open model has its own hardware requirements, though — which is where we start.

The hardware

The hardware is the machine the model runs on, and memory is the number that decides what you can load. Most of the work falls to the GPU (graphics processing unit) — the same chip that draws video games, whose thousands of small cores happen to be exactly what the math behind AI needs, so it runs a model many times faster than an ordinary processor. A consumer GPU (RTX 3090, 4090, 5090) is the fastest path; a Mac's unified memory or a mini-PC like an AMD Strix Halo or NVIDIA DGX Spark can hold a much bigger model but runs it slower; clusters are only for the frontier. Start by checking what you already have.

Step 1. Check your memory. On Windows, Task Manager > Performance shows your RAM and your GPU's "dedicated memory" (VRAM). On a Mac, Apple menu > About This Mac. Memory is the one number that decides what you can run.

The runner

The runner is the app that loads a model file and serves it, usually behind an OpenAI-compatible endpoint everything else can talk to. Two are beginner-friendly — LM Studio (a window you click) and Ollama (a terminal) — and both wrap the same engine, llama.cpp; vLLM is the step up when several people share one box. You also pick a quantization, a compressed copy of the model that makes it fit; pick Q4_K_M or Q5_K_M to start, and the Quantization section below explains what you are trading.

Step 2. Install one app. Download LM Studio (free, Windows/Mac/Linux). It is a regular desktop app with a built-in model browser that warns you before you download something too big for your machine. If you live in a terminal, install Ollama instead. Do not compare five tools this week. Most of them are wrappers around the same engine (llama.cpp), so you lose nothing by starting with one.

Step 3. Download one model. In LM Studio's search tab, pick by what your machine has:

Your machine	Model to download	Why
8 to 16 GB RAM, no real GPU	Qwen3 8B, Q4_K_M	Small enough to fit, smart enough to be useful
16 to 24 GB VRAM GPU (RTX 3090, 4090)	Qwen 3.6 27B, Q4_K_M	The best model that fits one consumer card
Mac with 32 to 48 GB unified memory	Qwen 3.6 27B, Q4_K_M	Same model, Apple Silicon runs it well
Mac or PC with 64 GB+	Qwen 3.6 35B-A3B, Q5_K_M	Faster responses, more headroom

Step 4. Load it and type a prompt. That is the whole setup. If the model crashes on load or answers in word salad, jump to the troubleshooting section; both failures have two-minute fixes.

If you want the model picked for your exact machine instead of a table of four rows, our model picker matches every current open model against your memory and shows expected speed.

The harness

Give it tools with an agent harness. Chatting is the demo; the real upgrade is letting your local model do work — read files, run commands, and work toward a goal in a loop instead of answering one message at a time. A harness connects to the OpenAI-compatible endpoint your runner already serves, so it is the same model from the runner step, now pointed at real work instead of a chat box. You may already know the harnesses from the big labs — Claude Code and Cowork, Codex, GitHub Copilot. The open-source world does not stay behind: OpenClaw or the Hermes agent for a personal assistant, opencode or pi for coding. And because it is open, anyone can take part: YouTuber PewDiePie recently shipped his own harness, Odysseus, and actress Milla Jovovich co-launched MemPalace, a memory layer that gives these harnesses long-term recall.

Step 5. Attach a harness. Install one and point it at the local endpoint your runner from steps 2 to 4 already serves — that is the whole integration, no extra glue. The catch is the model, not the plumbing:

But agents need a capable model. Driving a tool-using loop is much harder than answering one question, and the small 8B that is fine for chat will stall on it. Aim for at least Qwen 3.6 27B (dense) or Qwen 3.6 35B-A3B (MoE), and judge candidates by their Terminal-Bench 2.0 score — the benchmark that measures whether a model can actually drive an agent, not chat polish. That lifts the entry bar: useful agentic local AI starts around the serious-coding tier (a 24 GB GPU, or a 32 GB+ unified-memory machine), not the 8 to 16 GB you can comfortably chat on.

Use it from your phone. The personal-assistant harnesses above connect to the messaging apps you already use — WhatsApp, Telegram, Discord, Signal — so your home machine becomes just another contact. Text it from the train and it answers, and everyone in the household reaches it the same way, with no new app to install.

Match the tier to your use case

How much to spend comes down to what you will actually do with the model, and whether it serves just you or the whole household. Two shapes cover almost everyone:

The personal assistant. You sit at your desk, you want a code helper or a writing partner, and the machine only needs to serve you. This is laptop or desktop territory: an RTX 5090 in a tower, or a MacBook Pro with an M5 Pro or M5 Max. When the lid closes, the AI sleeps, and that is fine.

The household server. You want the assistant available from your phone on the train, from your partner's laptop, from the kids' homework machine. Now the model has to live on a separate always-on box: a used GPU rig, an AMD Strix Halo mini PC, or an NVIDIA DGX Spark humming in a closet. Different purchase, different power bill, and different software — you will care about multi-user serving (vLLM) and remote access (the harness above).

Here is the full ladder, with minimum and comfortable memory from our dataset:

What you want to do	Memory needed (min / comfy)	Typical hardware	Typical cost
Casual chat, summaries, drafts	8 / 16 GB	The PC or Mac you already own	$0
Serious coding assistant	20 / 48 GB	Used RTX 3090, RTX 5090, MacBook M5 Pro 48 GB	$1,750 to $4,500
Agents, long documents, vision	60 / 192 GB	Strix Halo 128 GB, DGX Spark, M5 Max 128 GB	$2,800 to $5,400
Frontier-class open models	192 GB and up	Spark clusters, Mac Studio M3 Ultra, RTX Pro 6000 rigs	$8,000 to $80,000

One honest warning before you spend anything: laptops do not run 70B-class dense models, no matter what the spec sheet implies. Thermals and memory bandwidth throttle them long before the marketing does. And the trendy 128 GB "AI boxes" have a real weakness the Dense models section below makes clear. Capacity and speed are different axes, and beginners get sold capacity.

Dense models: the entry point

A dense model uses all of its parameters for every word it generates. The size of the model is the size of the bill: a 27B dense model reads all 27 billion weights for each token, so memory capacity decides if it fits and memory bandwidth decides how fast it talks.

The good news for 2026: dense models top out, in practical terms, around 27B, and that size fits consumer hardware. Qwen 3.6 27B is the model that defines this tier. It scores 77.2 percent on SWE-bench Verified, which beats every open-weights mixture-of-experts model that fits in 128 GB or less, ships under Apache 2.0, accepts images, and carries a 262K-token context window. At Q4 it is about 16 GB of weights: 20 GB of memory to start, 24 GB to be comfortable.

That is why the used RTX 3090 keeps winning value arguments. Its 24 GB runs the best sub-frontier coding model available, on the most mature software stack there is, for $800 to $1,130 on the used market (around $1,050 average in late May 2026). A whole build lands near $1,750. Our dataset has it decoding 27B-class models at about 28 tokens per second, which is faster than you read.

How fast the same model runs elsewhere, from our benchmark data:

Build	Price (build)	27B-class decode	Prompt reading speed
RTX Pro 6000 Blackwell 96 GB Amazon ↗	~$12,200	85 tok/s	4,100 tok/s
RTX 5090 32 GB Amazon ↗	~$4,500	70 tok/s	3,900 tok/s
Mac Studio M3 Ultra 512 GB Apple ↗	~$14,199	38 tok/s	280 tok/s
DGX Spark 128 GB Amazon ↗	~$4,699	30 tok/s	700 tok/s
MacBook Pro M5 Max 64 GB Amazon ↗	~$4,099	30 tok/s	290 tok/s
RTX 3090 24 GB (used) Amazon (used) ↗	~$1,750	28 tok/s	1,400 tok/s
MacBook Pro M5 Pro 48 GB Amazon ↗	~$3,199	18 tok/s	145 tok/s
AMD Strix Halo 128 GB Amazon ↗	~$2,799	16 tok/s	240 tok/s

For dense models, the cheaper card wins on speed: the $1,750 used build outruns the $2,799 AI mini PC at this tier, and the $4,699 Spark only ties a MacBook. Dense models reward memory bandwidth, and consumer NVIDIA cards have it.

MoE models: more brain, same memory bill per word

A mixture-of-experts (MoE) model stores many specialist sub-networks but activates only a few per token. Qwen 3.6 35B-A3B holds 35 billion parameters but reads just 3 billion per word. Storage cost: 18 GB at Q4, barely more than the dense 27B. Speed cost per token: a fraction of it. That is why MoE models feel snappy on bandwidth-starved hardware like the Strix Halo or base Macs.

The current ladder, all sizes at Q4 from our dataset:

Model	Total / active params	Download size	Min memory	What it gets you
Qwen 3.6 35B-A3B	35B / 3B	18 GB	22 GB	Near-27B quality, much faster on weak bandwidth
Qwen 3.5 122B-A10B	122B / 10B	67 GB	80 GB	First open Qwen with native vision, 262K context
DeepSeek V4-Flash	284B / 13B	160 GB	192 GB	Agent-grade, 1M-token context

The surprising 2026 result: the new-generation 35B-A3B beats the previous-generation 122B on coding and agent benchmarks while using a quarter of the memory (the bigger model keeps small leads on broad-knowledge tests). Generations beat size. Before you buy hardware to run a bigger model, check whether a newer small one already does the job; our State of Local AI page tracks exactly this.

The 122B tier is where the 96 to 128 GB machines earn their keep:

Build	Price	122B-class decode	Prompt reading
MacBook Pro M5 Max 128 GB Amazon ↗	$5,399	65 tok/s	1,325 tok/s
DGX Spark 128 GB Amazon ↗	$4,699	57 tok/s	1,150 tok/s
AMD Strix Halo 128 GB Amazon ↗	$2,799	47 tok/s	380 tok/s
RTX Pro 6000 Blackwell 96 GB Amazon ↗	~$12,200	190 tok/s	5,800 tok/s

The pattern flips from the dense table: on a 10B-active MoE, the unified-memory boxes suddenly look great at decode. This is the rule that saves the most money: big-memory boxes are MoE machines. If you buy one, plan to run MoE models on it.

The high end: clustering territory

Above 192 GB you are chasing the open-weights frontier, and the price curve goes vertical. The three models that matter as of June 2026, in ascending order of both quality and pain:

DeepSeek V4-Flash (284B, 160 GB at Q4, 192 GB min). The affordable end of frontier: it fits a 2x DGX Spark cluster ($9,500, about 24 tok/s on this class of model) or a Mac Studio M3 Ultra 256 GB ($7,999, about 22 tok/s).

NVIDIA Nemotron 3 Ultra (550B total, 55B active, 300 GB at Q4, 340 GB min). Released June 4, 2026, currently the strongest US open-weights model, with native 1M context. You need a 4x Spark cluster ($19,500) or a quad RTX Pro 6000 workstation ($38,000).

Kimi K2.6 (1T parameters, 540 GB at Q4, 600 GB min). The top open-weights agent model on Terminal-Bench 2.0 (66.7 percent). Homelab paths: 8x DGX Spark ($43,500, about 13 tok/s) or two used Mac Studio M3 Ultra 512 GB linked together (around $28,400). Yes, those are car prices.

Clustering also brings real engineering: 2x the GPUs gets you roughly 1.5x the throughput, not 2x, and mixed cards bottleneck on the slowest one. A multi-GPU tower wants a 1,200 W+ power supply and full PCIe lanes per card. Buy a cluster to fit a model that cannot fit otherwise, never for speed alone.

Unless you specifically need the frontier, this tier is not for you. A $2,000 to $5,000 machine running Qwen 3.6 27B or 35B-A3B covers code help, writing, documents, and vision for one person with money left over. The build picker will tell you the cheapest build that fits whatever model you have in mind.

Quantization: how a 54 GB model fits in 16 GB

Models are trained in 16-bit precision (fp16), two bytes per parameter. Qwen 3.6 27B at fp16 would need roughly 54 GB. Quantization stores those weights with fewer bits, and it is the reason consumer hardware can play at all. The formats you will actually meet:

Format	Bits	27B model size	When to use it
fp16	16	~54 GB	Reference quality; you will almost never run this at home
fp8	8	~27 GB	Serving stacks (vLLM) on modern GPUs; near-lossless
nvfp4	4	~14 GB	NVIDIA's 4-bit format; fast path on Blackwell cards (5090, Spark, Pro 6000)
Q8_0	8	~29 GB	GGUF 8-bit; effectively indistinguishable from fp16
Q5_K_M	5	~20 GB	GGUF 5-bit; the default sweet spot, 96 to 98 percent of full quality
Q4_K_M	4	~16 GB	GGUF 4-bit; what most people run, 92 to 95 percent of quality

About those suffixes: GGUF is the file format used by llama.cpp, LM Studio, and Ollama, and the K-quant suffixes (_K_M, _K_S) mark newer mixed-precision schemes that keep the most sensitive layers at higher precision. The _M variant is "medium" and the one you want; _S saves a little more space for a little more damage.

The decision rule, and you can stop thinking after applying it: Q4_K_M when memory is tight, Q5_K_M as the default, Q8_0 when you have room to spare. Below Q4 the quality drop gets noticeable fast; a Q2 of a big model is usually worse than a Q5 of a smaller one. Our picker defaults to Q4_K_M sizes for exactly this reason.

Quantization is also your bandwidth fix: a Q4_K_M download is about 70 percent smaller than fp16, which matters when models run 16 to 540 GB and your internet connection is the bottleneck.

Backends: the layer that decides your pain level

The backend is the GPU compute layer your runner is built on. You mostly do not choose it directly; you choose hardware, and the backend experience comes bundled.

CUDA (NVIDIA). The reference. Every tool supports it first, day one, with the fewest bugs. The entire reason "just buy NVIDIA" is the default advice is this column, not the silicon.

MLX (Apple). Apple's own framework for M-series chips, and the second-best supported path in practice. LM Studio uses it natively; performance on M5-generation Macs is excellent for the watts.

ROCm (AMD). Improving fast on Linux, where it is now a credible stack for the cards we list. On Windows it remains the friction path: support effectively means WSL2 or selected apps, and community guides still open with workarounds. If you run AMD on Windows, use the Vulkan backend instead.

Vulkan (cross-vendor). The universal fallback that runs on nearly anything, including AMD on Windows and Intel. Often within striking distance of native stacks, sometimes ahead on consumer AMD. LM Studio ships it built in.

Intel (SYCL / IPEX / OpenVINO). The budget VRAM play (the Arc Pro B70 is the cheapest new 32 GB card at $1,800) with the youngest software. It runs the big models via Vulkan and OpenVINO's llama.cpp backend, but expect setup time the CUDA crowd never sees.

The honest summary for a first build: NVIDIA if you want zero friction, Apple if you live in the Mac world, AMD or Intel if VRAM per dollar excites you more than your weekend does.

The apps: pick one of two, graduate later

Five names dominate every comparison thread, and the secret is that most of them are the same engine wearing different clothes. Ollama, LM Studio, Jan, and friends all run llama.cpp (plus MLX on Macs) under the hood.

Start with LM Studio if you want a point-and-click app: model browser with fit warnings, chat window, one-click local server. Start with Ollama if you want a command-line tool that scripts cleanly and exposes an OpenAI-compatible API by default. There is no wrong pick; both take minutes.

Graduate to llama.cpp itself when you want every last token per second and flags worth tuning (it is what the wrappers run anyway, just with the dials exposed).

Graduate to vLLM when more than one person hits the box at once. It is a serving engine: continuous batching, much higher total throughput, the standard choice for the household-server setup. It wants real VRAM and prefers fp8/AWQ-style quants over GGUF.

Performance recipes: free speed, in order of effort

A recipe is the launch setup for one model on one runner. It is the exact parameters for running the model you want on the backend or runner you are using: which build or docker image to use (a community-tuned image or the official one), the MoE-specific flags, the context length, and the KV-cache settings. Get them right and you unlock the speed below; get them wrong and the model crawls or will not load. As a beginner, copy a proven recipe rather than experiment — for your exact model-and-hardware pairing, search the web (popular setups usually have one posted), and we cover the most popular pairings ourselves: the State of Local AI data ships a tested launch command per build, flags included.

The spread between a stock setup and a tuned one on identical hardware is large. Published Qwen 3.6 27B numbers on a used RTX 3090 run from 72 tokens per second with a one-click build to 85 sustained while holding 125K of context, and on an RTX 5090 a tuned server reports 158; the toggles below are most of that gap.

Offload all layers to the GPU. The single biggest llama.cpp gotcha: if you see -ngl at its default in some setups, part of the model runs on CPU. Set it to cover every layer (99 works as "all"). LM Studio exposes this as a GPU offload slider.

Turn on Flash Attention and quantize the KV cache. Two flags in llama.cpp (-fa, plus --cache-type-k q8_0 --cache-type-v q8_0) that cut the memory your context consumes roughly in half with negligible quality cost. This is often the difference between "32K context fits" and "out of memory."

Right-size your context. Context (the model's working memory) costs VRAM per token whether you use it or not. Running 8K when you chat casually and 32K when you code is smarter than maxing it because the slider goes there.

MoE CPU offload, the budget cheat code. llama.cpp can keep a MoE model's always-active layers on the GPU and park the rarely-used experts in system RAM. A 24 GB card plus 64 GB of system RAM can run 80 to 120B-class MoE models at usable speeds this way. If you own a 3090 and 64 GB of RAM, try this before buying anything.

Speculative decoding and MTP. Newer models (Nemotron 3, DeepSeek V4) ship multi-token prediction heads that generate several tokens per step; backends increasingly support them, and they account for the top end of community benchmark numbers. Free real speed, when your stack supports the model's MTP config.

Use the right chat template. Not actually a speed trick, despite what older guides say. The template is the formatting wrapper that turns your chat into the token stream the model was trained on, and the modern fix is passing --jinja in llama.cpp so the model's own embedded template gets used. What a wrong template actually causes is covered next, because it is the most misdiagnosed failure in local AI.

When it breaks: the three errors everyone hits

Three failures account for most "local AI is broken" posts. All three look fatal and none of them are.

"CUDA out of memory" (or the app just crashes on load)

The most common first-run error there is. The model plus its context does not fit your VRAM, usually because the context buffer pushed it over. Nothing is broken. Fix in this order:

Lower the context size (in Ollama: /set parameter num_ctx 2048; in LM Studio it is the context slider). Try again.
Still failing: download one quant level lower (Q4_K_M instead of Q5_K_M).
On llama.cpp: quantize the KV cache (--cache-type-k q8_0 --cache-type-v q8_0).
Last resort: partial CPU offload (lower -ngl a few layers below "all"), and accept the speed cost.

A rule that prevents the error entirely: weights should use at most about 80 percent of your VRAM, because context needs the rest.

Gibberish output (word salad, endless repetition, wrong language)

Looks like a broken model; almost never is. Two causes cover nearly every case. Either you downloaded the base model instead of the instruct one (base models complete text, they do not answer questions; always pick the file with "instruct" or "chat" in the name), or the chat template is wrong, meaning the model is receiving your conversation in a format it was never trained on. The fix: use the instruct build, let the runner use the model's own template (--jinja in llama.cpp; LM Studio and Ollama do this automatically for known models), and if it persists, delete and re-download the model, since a corrupted or mislabeled download produces exactly this.

"It forgot what we were talking about"

Not a bug, a silent truncation. Runners default to small context windows (Ollama historically 2K to 4K tokens), and when your conversation exceeds it, the oldest messages quietly fall off. The model is not dumb; it literally never saw them. Raise the context (num_ctx in Ollama, the slider in LM Studio), but mind the trade: context eats VRAM, and if it overflows into system RAM your speed collapses by an order of magnitude. The KV-cache quantization flags from the Performance recipes section buy you roughly double the context for the same memory.

Buying traps and the real cost of ownership

The mistakes that quietly cost the most money:

The 8 GB GPU trap. An 8 GB card technically runs small models and immediately suffocates: after the weights there is no room for context. 12 GB is the usable floor, 16 GB is a real starting point, 24 GB is where the good models live. If the budget only reaches an 8 GB card, skip the GPU and run a small MoE on your CPU and RAM instead; you lose less than you think.

VRAM is the limit, bandwidth is the speed, cores are marketing. LLM inference reads the whole active model per token, so memory size decides what fits and memory bandwidth decides tokens per second. Core counts and boost clocks barely move the needle. This is also the "capacity is not speed" rule from the Dense models section: a 128 GB box with 273 GB/s of bandwidth holds big models and runs dense ones slowly.

System RAM and storage are not afterthoughts. 32 GB system RAM minimum, 64 GB if you want the MoE-offload trick from the Performance recipes section. Models are huge: 16 GB for a 27B at Q4, 67 GB for the 122B, 540 GB for Kimi K2.6. A 1 to 2 TB NVMe drive is part of the budget, and on a slow connection the download itself is your first bottleneck.

Heat, noise, and the power bill. A used 3090 pulls about 350 W under load; at $0.15 per kWh and 8 hours a day that is roughly $13 a month. An RTX 5090 build under the same load is about $19 a month, a DGX Spark closer to $9 to $12 with its 24/7 idle included, and an M5 MacBook a few dollars. The quad-GPU rigs from the high-end tier draw over 2,000 W, which is a dedicated circuit in most homes. None of this is ruinous, all of it belongs in the math.

Local versus cloud, with a straight face. If you spend under about $200 a month on AI APIs, the cloud is usually cheaper than amortizing a new rig; above it, or with privacy or always-on workloads in the mix, local wins on math. Run your own numbers in our picker's 5-year cost-of-ownership column before deciding; that column exists because nobody does this calculation soberly while they are excited.

Upgrade paths matter at purchase time. A used 3090 build leaves a clean path: add a second 3090 later for 48 GB total (about $3,100 all-in for a dual build). A Mac's memory is fixed forever at checkout, so buy the configuration you will want in two years, not the one that fits today. Sparks officially cluster over their built-in interconnect, which is the rare upgrade path that actually works as advertised.

The bottom line

Run the free thing you already own first. A 16 GB laptop running an 8B model costs nothing and teaches you what you actually want from a bigger machine. When you outgrow it, the 2026 value ladder is short: a used RTX 3090 at about $1,750 all-in for the best coding model per dollar, a 128 GB MoE box (Strix Halo at $2,799 if price wins, DGX Spark or an M5 Max MacBook if polish does) for the 122B tier, and clusters only when a specific frontier model demands one.

Whatever tier you land on, do not buy from a spec sheet. Tell our build picker what you want to run and it returns the cheapest builds that fit, with measured speeds, power draw, and current prices, or start from the model side and see what your hardware can already do. Real benchmarks, not vibes.

Sources

LLMRequirements hardware dataset (builds, prices, tok/s, power; updated 2026-06-13) and State of Local AI (internal).
Qwen 3.6 27B model card (Hugging Face): huggingface.co/Qwen/Qwen3.6-27B
Qwen 3.6 35B-A3B announcement and model card: qwen.ai/blog and huggingface.co/Qwen/Qwen3.6-35B-A3B
Qwen 3.5 122B-A10B on a single DGX Spark at up to 51 tok/s (NVIDIA Developer Forums): forums.developer.nvidia.com
NVIDIA Nemotron 3 Ultra 550B release, June 4, 2026 (NVIDIA Technical Blog): developer.nvidia.com
Terminal-Bench 2.0 leaderboard (Kimi K2.6 agent scores): tbench.ai/leaderboard
Kimi K2 family local-run sizes (Unsloth docs): unsloth.ai
Used RTX 3090 and RTX 5090 price history (Best Value GPU tracker): bestvaluegpu.com
DGX Spark MSRP raised to $4,699 (NVIDIA Developer Forums, Feb 2026): forums.developer.nvidia.com
Apple M5 Pro / M5 Max MacBook Pro launch (Apple Newsroom, March 2026): apple.com/newsroom
Prompt processing as the unified-memory weak spot, DGX Spark vs Strix Halo (The Register): theregister.com
Ollama vs LM Studio for beginners (CORSAIR): corsair.com
Most local runners are llama.cpp wrappers (Codersera, 2026 comparison): codersera.com
Qwen 3.6 27B one-click server, 158 tok/s on a 5090 / 72 tok/s on a 3090 (devnen, GitHub): github.com/devnen
Qwen 3.6 27B at 85 tok/s sustained with 125K context on one RTX 3090 (community write-up): medium.com
CUDA out-of-memory causes and fixes (Markaicode): markaicode.com
Gibberish output debugging, base vs instruct and chat templates (InsiderLLM): insiderllm.com
Context window misunderstanding, num_ctx (Ollama GitHub issue #2714): github.com/ollama
GGUF quantization levels compared, Q4_K_M / Q5_K_M / Q8_0 (Mustafa.net): mustafa.net
ROCm on Windows status and the Vulkan fallback (RunAIHome): runaihome.com
Local vs cloud break-even analysis (SitePoint, 2026): sitepoint.com
Agent harnesses that run on a local OpenAI-compatible endpoint: OpenClaw, Hermes Agent (Nous Research), opencode, pi
Community-built harnesses: PewDiePie's self-hosted workspace Odysseus (github.com/pewdiepie-archdaemon/odysseus, May 2026) and Milla Jovovich's local AI memory system MemPalace (github.com/milla-jovovich/mempalace, Apr 2026)