CUDA vs Vulkan for llama.cpp: Which Backend Should You Actually Run?
As of June 2026.
When you build llama.cpp, you pick a backend: the code path it uses to talk to your GPU. The two that come up most are CUDA (NVIDIA's own toolkit) and Vulkan (an open cross-vendor graphics and compute API). The choice changes how fast your model runs, which GPUs work at all, and how much setup you sign up for.
Here is the short version, and then the numbers behind it.
Short answer
- NVIDIA card: use CUDA. It is faster at everything and every tool supports it on day one.
- AMD card: it depends. ROCm is fastest at chewing through prompts; Vulkan is often faster at generating tokens, runs on Windows, and needs no toolkit install. For most people on a Radeon, Vulkan is the path of least resistance.
- Intel Arc: use SYCL (Intel's path) if you want speed; Vulkan is the easy fallback that runs everywhere but is slower.
- Apple Silicon: neither. Use the Metal backend, which is the native path on a Mac.
- Old, mixed, or unsupported GPUs: Vulkan. It is the one backend that runs on almost anything with a driver.
Vulkan is not "the slow option" anymore. In 2026 it is the universal option that has quietly closed most of the token-generation gap. CUDA is still the fastest option on the hardware it owns.
What these backends actually are
A backend is the layer that turns llama.cpp's math into instructions your GPU can run. Same model, same weights, different driver underneath.
- CUDA is NVIDIA's proprietary stack. It only runs on NVIDIA GPUs, but it is the most mature compute platform in AI, and llama.cpp's CUDA backend gets the most attention.
- Vulkan is an open standard (from the Khronos Group) that runs on NVIDIA, AMD, Intel, Apple (via a translation layer), Android, and a long tail of older cards. One build, almost every GPU.
- ROCm is AMD's CUDA equivalent. Fast on supported Radeon and Instinct cards, but historically Linux-first and pickier about which GPUs it supports.
- SYCL is Intel's path for Arc GPUs, tuned for Intel's matrix hardware.
llama.cpp ships all of these. The question is which one to compile and run for your card. Our hardware dataset tags every build with its stack and a software-maturity score for exactly this reason: a CUDA NVIDIA build scores 5 of 5 for maturity, a ROCm Radeon build sits around 3, and the newer SYCL/Vulkan Intel path is a 2 (measured, from our dataset, last refreshed June 13, 2026).
The one split that explains everything: prefill vs decode
Two numbers decide whether a backend feels fast, and they behave differently:
- Prompt processing (prefill, often "pp512"): how fast the model reads your prompt before it answers. This is compute-bound, so it rewards raw math throughput and mature kernels. CUDA and ROCm win here.
- Token generation (decode, often "tg128"): how fast it writes the answer, token by token. This is mostly memory-bandwidth-bound, so a less-optimized backend can keep up as long as it can move data. This is where Vulkan has caught up.
Keep this split in mind. Almost every "is Vulkan good now?" argument online is really two arguments, and the answer is different for each half.
NVIDIA: CUDA wins, clearly
On NVIDIA hardware, CUDA is faster on both halves, and the gap is biggest on prefill.
An independent llama.cpp scoreboard published April 23, 2026 measured an RTX 5090 at roughly 14,073 pp512 and 290 tg128 on CUDA, versus 10,382 pp512 and 264 tg128 on Vulkan (reported). That is CUDA about 36% faster at prompt processing and about 10% faster at token generation on the same card. Flash Attention widens the prefill lead further on CUDA.
So on a 5090, a 3090, or any GeForce or RTX Pro card, there is no real debate: build the CUDA backend. Our own throughput figures for these cards assume CUDA, for example the RTX 5090 at about 186 tok/s on an 8B model and 70 tok/s on a dense 30B, and the used RTX 3090 at about 90 tok/s on 8B (measured, from our dataset). You would leave that performance on the table running Vulkan on green hardware. The only reasons to choose Vulkan on an NVIDIA card are corner cases: a GPU too old for current CUDA, or a single binary you need to run across mixed-vendor machines. See the RTX 5090 build page for the full numbers.
AMD: this is where it gets interesting
On a Radeon, the "obvious" answer (use AMD's own ROCm) is not always the right one.
ROCm wins prompt processing. But for token generation, Vulkan frequently matches or beats it, and Vulkan has two big practical advantages: it runs on Windows (ROCm for llama.cpp is effectively Linux-only) and it needs no toolkit install, just a current driver.
Concrete numbers, from a Radeon llama.cpp comparison published October 18, 2025 (reported): with the updated RADV Vulkan driver, prompt processing reached about 624 pp512 versus ROCm's 753 (ROCm ahead on prefill), but token generation came in at 49.4 tg128 on Vulkan versus 47.0 on ROCm (Vulkan ahead on decode). The same RADV update lifted AMD prompt processing about 13% on its own. That decode edge for Vulkan has been reported across context lengths and is an open, tracked behavior in the llama.cpp project on cards like the RX 7900 XTX.
For the newer 32GB workstation cards the audience keeps searching for, the picture matches: Phoronix's ROCm 7.1 review put the Radeon AI PRO R9700 close to the W7900 on prefill, and our dataset scores both the R9700 and the older MI50 on a ROCm-plus-Vulkan reality, with Vulkan as the everyday fallback (the legacy MI50's gfx906 silicon was dropped from current ROCm, so a community Vulkan build is what keeps it alive).
Rule of thumb for AMD: if you are on Windows or you just want it to work, start with Vulkan. If you are on Linux, want the fastest prefill, and your card is on AMD's supported list, install ROCm.
Intel Arc: SYCL for speed, Vulkan for ease
Intel is the mirror image of the AMD story. Intel's own SYCL backend is meaningfully faster than Vulkan on Arc, because it uses Intel's matrix hardware that the Vulkan driver does not fully tap yet. Community benchmarks on the 32GB Arc Pro B70 show SYCL decoding at roughly 21 tok/s versus about 11 tok/s on Vulkan for a quantized model (reported), close to a 2x gap.
Our dataset tags the B70 as "SYCL / Vulkan" at software-maturity 2 of 5 with a note that the Vulkan and IPEX-LLM paths work today while the SYCL stack is improving fast but not yet production-grade. Translation: Vulkan gets you running on an Arc card with the least fuss, but if you care about speed, the SYCL build is worth the extra setup. See the Arc Pro B70 build page.
Apple Silicon: skip both
On a Mac, the native path is Metal, and it is the one to use. Vulkan technically runs on Apple through a translation layer, but it adds overhead for no benefit. If you are choosing a Mac for local AI, the backend question is already answered for you.
So when should you pick Vulkan on purpose?
Vulkan is the right call more often than its old reputation suggests:
- You have an AMD card and run Windows, or you just do not want to wrestle with ROCm.
- You have an older or unusual GPU that CUDA, ROCm, or SYCL no longer support.
- You run across mixed hardware and want one binary that works on all of it.
- You care most about token generation, not prompt processing, and you are on AMD.
What you give up: top-end prompt processing on NVIDIA and Intel, and a bit of driver fragility. Vulkan is more driver-sensitive than CUDA, and it lacks the deep profiling tooling NVIDIA developers have, a point the maintainers made plainly in a FOSDEM 2026 talk (January 31, 2026) on the backend. The same talk laid out how far it has come: a custom Flash Attention shader, BFloat16 support, operator fusion, and quantized matrix-multiply optimizations that benefit older Intel, AMD, and NVIDIA parts alike. The direction of travel is clear, even if CUDA stays ahead on its own turf.
How this maps to a buying decision
The backend you will run is part of the cost of a card, not an afterthought. An NVIDIA card buys you the most mature software with zero workarounds. An AMD card saves money up front and runs fine on Vulkan, with ROCm available when you want the last bit of prefill speed. An Intel Arc card is the cheapest route to 32GB of VRAM, as long as you accept that the fastest path (SYCL) still asks for patience.
If you are weighing cards, do not guess at this. Our build picker ranks every GPU by real throughput at your budget and model, and the compare view puts any two cards head to head on speed, VRAM, and software maturity, the exact score that captures this CUDA-versus-Vulkan reality. If you already know your model and want the right card for it, start from the model picker.
Pick the card that runs your model best at llmrequirements.com.
Sources
- llama.cpp GPU performance scoreboard: CUDA, ROCm, Vulkan with pp512 / tg128 / Flash Attention (knightli, April 23, 2026): knightli.com
- "Vulkan API for Machine Learning: Competing with CUDA and ROCm in llama.cpp" (FOSDEM 2026 talk, January 31, 2026): fosdem.org and talk notes
- llama.cpp on AMD: RADV Vulkan driver update, prompt processing and token-generation numbers vs ROCm (Hardware Corner, October 18, 2025): hardware-corner.net
- AMD ROCm 7.1 vs RADV Vulkan for llama.cpp on the Radeon AI PRO R9700 (Phoronix): phoronix.com
- Lower token-generation performance on ROCm vs Vulkan, RX 7900 XTX (gfx1100), tracked issue (llama.cpp, ggml-org): github.com/ggml-org
- Intel Arc Pro B70 llama.cpp benchmarks, SYCL vs Vulkan decode (PMZFX, GitHub): github.com/PMZFX
- Performance VULKAN vs CUDA discussion (llama.cpp, ggml-org): github.com/ggml-org
- LLMRequirements hardware dataset: software stack, maturity scores, and throughput per build (internal), and the State of Local AI essay.