All guides

Best GPU for Qwen3 27B: RTX 5090 vs R9700 vs Arc Pro B70 vs MI50

Qwen 3.6 27B is the model that made a lot of people start pricing out a local box. It is a dense 27B that scores 77.2% on SWE-bench Verified, which is higher than Qwen's own 397B mixture-of-experts model from the previous generation, and it ships under Apache 2.0 with a 256K context window. In plain terms: a coding model you can actually fit on one card now matches things that used to need a server.

So the question in everyone's search bar is simple. Which card should run it?

Short answer: they all fit, so pick on software and speed

Qwen 3.6 27B at Q4 is about 16 GB of weights. Add room for context and you want roughly 20 GB to start and 24 GB to be comfortable. That means every card below runs it, including a used 24 GB RTX 3090. The decision is not "will it fit." The decision is how mature the software stack is, how fast it decodes, and what you pay to get there.

Here is the lineup, with build cost meaning the card plus a roughly $700 host PC. All hardware figures come from our own dataset, last price-checked on June 3, 2026.

Build (single card) Build cost VRAM (usable) Memory bandwidth Software stack Maturity vLLM
RTX 5090 ~$4,500 32 GB (31) 1,792 GB/s CUDA (Blackwell) 5 / 5 Production
RTX 3090 (used) ~$1,750 24 GB (23) 936 GB/s CUDA 5 / 5 Production
AMD R9700 ~$2,050 32 GB (31) 640 GB/s ROCm (RDNA 4) 3 / 5 Community
Intel Arc Pro B70 ~$1,800 32 GB (31) 608 GB/s SYCL / Vulkan 2 / 5 Experimental
AMD MI50 32 GB (used) ~$700 32 GB (31) 1,024 GB/s ROCm (gfx906) / Vulkan 2 / 5 None

Two things jump out. The cheapest new 32 GB cards (R9700 and B70) actually have less memory bandwidth than a five-year-old 3090, and the cheapest card on the list, a used MI50, has more bandwidth than either of them. Bandwidth is what sets the ceiling on how fast a dense model like the 27B spits out tokens, so keep that in mind as you read the speed section.

How fast is Qwen 3.6 27B on each card?

A dense 27B is bandwidth-bound when it generates text, because every one of its 27 billion parameters has to be read from memory for each token. Prompt processing (chewing through a long prompt before it answers) is the opposite, it is compute-bound, and that is where NVIDIA's Blackwell parts pull far ahead. Our data has the RTX 5090 ingesting prompts several times faster than the AMD and Intel cards.

For single-stream decode speed on the dense 27B, the most useful numbers come from people running it at home, so treat these as community-reported and config-dependent rather than lab-controlled:

If decode speed on the cheaper cards matters more to you than peak quality, there is a shortcut worth knowing. Qwen shipped a sibling model, Qwen 3.6 35B-A3B, a mixture-of-experts that only activates 3 billion parameters per token. It fits the same 32 GB at Q4 and feels noticeably snappier on bandwidth-limited cards, because the card only has to read 3B of weights per token instead of 27B. If you want the 27B's coding quality, run the dense model. If you want speed on a budget card, the A3B MoE is the play. Our model picker shows both side by side with per-card numbers.

One more thing worth saying out loud, because it saves money: spending $2,700 to $4,700 on a big unified-memory box like a Strix Halo or DGX Spark does not unlock a meaningfully better coding model than this 27B on a 24 GB 3090. The dense 27B beats every open-weight MoE that fits in 128 GB or less. Capacity buys you the ability to run frontier MoE models that need 600 GB or more, not a better experience at the 27B tier.

Rated bandwidth vs what the software actually uses

Memory bandwidth is the ceiling, but you only reach it if the inference kernels are good enough to keep the memory bus busy every cycle. This is where the maturity column quietly turns into real tokens per second, and it is the single biggest reason two cards with similar bandwidth can decode at very different speeds.

On the CUDA cards the kernels are mature enough to run the bus close to flat out. A well-tuned llama.cpp or vLLM decode on an RTX 5090 or 3090 sustains something near the card's rated bandwidth, so the 27B's decode speed lands roughly where the 1,792 GB/s and 936 GB/s figures predict. Years of CUDA-first kernel work is exactly what you are paying for.

The younger stacks leave a big chunk on the table. The Intel Arc Pro B70's SYCL and Vulkan paths are improving fast but are still immature, and in practice they saturate only around half of its 608 GB/s on dense decode today. That is the real reason its tokens per second trail a 3090 by more than the raw bandwidth gap suggests: the 3090 turns most of its 936 GB/s into tokens, while the B70 turns roughly half of its 608 GB/s into tokens. The AMD R9700's ROCm path sits in between, and the used MI50 is the most under-saturated of all relative to its 1 TB/s HBM2, which is exactly why its on-paper bandwidth advantage so rarely shows up in real runs.

Card Rated bandwidth Bandwidth the software reaches today Why
RTX 5090 1,792 GB/s Near full Mature CUDA / Blackwell kernels
RTX 3090 936 GB/s Near full Mature CUDA kernels
AMD R9700 640 GB/s Partial ROCm 7 improving, not yet CUDA-level
Intel Arc Pro B70 608 GB/s ~Half Young SYCL / Vulkan backend
AMD MI50 32 GB 1,024 GB/s Low fraction Community Vulkan build, no vLLM path

These utilisation figures are rough and config-dependent — they move with driver version, quantization, batch size and context length. Treat them as the shape of the gap, not a benchmark. The point holds regardless: on the immature stacks, raw bandwidth overstates real decode speed.

The cards, one by one

RTX 5090: fastest, if you can stomach the price

The 5090 is the fastest single GPU under $5,000 and the only card here that stays fast as your context grows past 100K tokens. Its 32 GB fits the 27B with huge context headroom, the CUDA 13 stack is rock solid, and vLLM plus TensorRT-LLM both supported it on day one. The catch is price. The $1,999 MSRP is effectively a paper number; real cards run $3,500 to $3,999 on Amazon and Newegg, with premium SKUs past $4,500, so a full build lands around $4,500. Buy it if prompt speed and long-context decode are worth the premium. See the full RTX 5090 build page.

Buy on Amazon ↗

RTX 3090: the value baseline that quietly wins

The used 3090 is the build our data recommends for anyone running a 27B-class dense model at or under $1,500 of card cost. 24 GB of CUDA Ampere fits Qwen 3.6 27B at Q4 with 32K context, and because it is CUDA, every tool works on day one with no workarounds: vLLM, TensorRT-LLM, llama.cpp, ComfyUI. Used cards run $800 to $1,130 (about $1,050 average in late May), pushed up by a memory-chip price spike and 5090 prices doubling. If you want zero software friction for the least money, this is the answer, and you can drop in a second 3090 for 48 GB later.

Buy on Amazon (used) ↗

AMD R9700: the cheapest new 32 GB card

The Radeon AI Pro R9700 is the cheapest brand-new 32 GB workstation card by a wide margin, at a $1,299 launch price (partner boards like the ASRock Creator and PowerColor run roughly $1,350 to $1,520 as memory contract pricing pushes them up). It is RDNA 4, runs both ROCm 7 and Vulkan, and is Windows-friendly, which the data-center cards on this list are not. The trade-offs are real: 640 GB/s bandwidth means dense decode is slower, prompt processing is about three times slower than the 5090, and vLLM support is still community-grade rather than production. Good pick if you want a new, warrantied AMD card and value capacity over raw speed. R9700 build page.

Buy on Amazon ↗

Intel Arc Pro B70: cheapest 32 GB on the market, if you tolerate setup

At a $949 launch price (street cards now around $1,099, and it has been Newegg's number-one workstation GPU), the Arc Pro B70 is the cheapest way to get 32 GB of discrete VRAM, period. It is quiet, has ECC memory, and the software is improving fast: Intel's OpenVINO 2026.1 added a first-class llama.cpp backend, and the Vulkan path already runs Qwen 3.6 27B today. But the SYCL and IPEX-LLM stack is still two to three years behind CUDA, which is the honest reason NVIDIA keeps winning here, and why it reaches only about half its rated bandwidth on dense decode. Expect to spend more time on setup. Buy it for the best VRAM-per-dollar if you enjoy tinkering. Arc Pro B70 build page.

Buy on Amazon ↗

AMD MI50 32 GB: the homelab bargain with a software tax

The used MI50 is the best dollar-per-gigabyte-per-bandwidth card on Earth right now: about $150 to $250 for a 32 GB card with 1 TB/s HBM2, plus a blower kit and a host. That bandwidth is why homelabbers love it. The tax is software. AMD dropped the gfx906 silicon from ROCm 7, so the community keeps it alive with a llama.cpp Vulkan build and a maintained ROCm 6.4.4 fork, and there is no real vLLM path. Independent benchmarks land well below the most optimistic community claims, so set expectations and verify the 32 GB VBIOS before you buy, because some sellers reflash 16 GB cards. Buy it if the build is the hobby. MI50 build page.

Buy on Amazon ↗

Which should you pick?

Every one of these runs Qwen 3.6 27B. The right answer depends on your budget, how much you care about speed versus capacity, and how much software friction you will accept.

The fastest way to settle it for your exact situation is to run the numbers yourself. Our build picker ranks every card here for the 27B at your budget, and the compare view puts any two of them head to head on speed, VRAM and software maturity.

Pick your Qwen 3.6 27B build at llmrequirements.com.

Sources