Local-first LLM runners compared: Ollama vs. llama.cpp vs. the rest
· 6 MIN READ
How to pick a local LLM runner by the job, not the benchmark: Ollama, llama.cpp, and the alternatives across local hacking, embedding in an app, CPU-only boxes, and GPU servers.
The pick depends on the job — not the throughput benchmark. Hacking on your laptop is a different job from embedding a model in an app, which is different from running inference on a CPU-only box, which is different again from serving real concurrent traffic on a GPU cluster. Each has a clear answer, and the answer is rarely the same tool.
| Job | Reach for | Alternative when… |
|---|---|---|
| Local hacking / run a model on your laptop | Ollama | LM Studio or Jan for a GUI instead of the terminal |
| Embedding in an app / programmatic | llama.cpp server + bindings | llamafile for a single portable binary |
| CPU-only box | llama.cpp + GGUF quantization | Ollama for easier model management on the same box |
| GPU server / production throughput | vLLM | SGLang for latency-sensitive work or same-day new model support |
| Apple Silicon | Ollama (MLX engine) or llama.cpp Metal | MLC for custom compilation targets |
The cast, by layer
Before the jobs: what layer each runner actually occupies, because several of them are the same engine wearing different clothes.
llama.cpp is the engine most of this ecosystem runs on. It's a C++ inference library built around the GGUF format, a quantized model format that makes LLMs fit on consumer hardware. llama.cpp ships a --server flag that starts an OpenAI-compatible HTTP endpoint, plus language bindings (llama-cpp-python, node-llama-cpp). It runs on Mac, Linux, Windows, CPU-only, Metal, and CUDA. "Runs everywhere" means some assembly required.
Ollama wraps llama.cpp — and an MLX engine for Apple Silicon — in a daemon with a model registry, a CLI, and an OpenAI-compatible API at localhost:11434. It trades one axis of control for a dramatically shorter path from zero to a running model. Most people who describe themselves as "using llama.cpp" are running Ollama.
LM Studio and Jan are desktop GUI applications with llama.cpp underneath. For comparing model outputs in a chat interface without touching a terminal, or evaluating models with people who don't live in a terminal. GPT4All belongs in this tier too: another offline-first GUI wrapper.
llamafile is a packaging decision. llama.cpp compiled with Cosmopolitan libc into a single executable that carries its model weights and runs on Linux, Mac, and Windows without any installation step. Ship it to a machine you don't control, run it in CI, hand it to a non-engineer — it works.
vLLM is the outlier — not built on llama.cpp, targets CUDA GPUs, and designed for production throughput rather than local use. Continuous batching and paged attention let it handle many concurrent requests efficiently in ways llama.cpp's server wasn't designed for. The 2026 open-source AI stack reaches for vLLM (or SGLang) when you're serving real traffic.
MLX is Apple's machine learning framework; the MLX-LLM ecosystem runs models natively on Apple Silicon unified memory without going through llama.cpp. MLC (Machine Learning Compilation) takes a compilation approach: it targets Metal, CUDA, and WebGPU from the same model definition, with more setup and more deployment flexibility.
Job 1 — Local hacking: just run the model
Ollama. One command:
ollama run llama3.2:3bIt downloads the GGUF, picks a sensible quantization, and starts serving at localhost:11434 on an OpenAI-compatible API. On Apple Silicon it switches to the MLX engine automatically, which runs noticeably faster than the Metal backend for most models. The CLI also manages multiple pulled models, so switching between them doesn't require re-downloading.
When to pick LM Studio or Jan: you want a side-by-side chat interface, you're evaluating models with a team that doesn't live in a terminal, or you want a visual resource graph during inference. The tradeoff is distance from the knobs — quantization, context length, sampling parameters — that you'd have with Ollama or llama.cpp directly.
Job 2 — Embedding in an app: programmatic and portable
When you're building something that calls a local model, you want a process you own and can iterate against without network round-trips or API key management. Two clear paths.
llama.cpp's server, for when you want a stable HTTP endpoint:
llama-server --model my-model.gguf --port 8080OpenAI-compatible, predictable, fully under your control. Or skip HTTP and use llama-cpp-python as a library, which gives you sampling parameters, logit access, and embedding generation that an HTTP adapter doesn't expose.
llamafile, for when you need to distribute the runner alongside the model:
chmod +x mymodel.llamafile && ./mymodel.llamafileA single file ships the weights and the server together. Useful when you're distributing to environments where you can't install system packages: CI runners, colleague machines, edge nodes.
Ollama's OpenAI-compatible API works fine for programmatic use too and is the lowest-friction starting point. The operational distinction: Ollama is a shared daemon with model lifecycle management, a dependency on a running service. llama.cpp's server or llamafile are processes you start and own per deployment.
Job 3 — CPU-only boxes: no GPU
llama.cpp, and specifically GGUF quantization, is why CPU inference is practical at all. At 4-bit quantization (Q4_K_M), a 7B parameter model fits in roughly 4–5 GB of RAM and runs inference without any GPU. You're trading throughput for hardware accessibility, not model quality — output at Q4_K_M is close to full precision for most tasks.
Ollama is still llama.cpp on a CPU-only box; it works fine and manages models more conveniently. Use whichever interface fits the rest of the system. The point is that GGUF quantization is what unlocks inference on hardware without a GPU, and that's a llama.cpp contribution regardless of which runner is wrapping it.
Job 4 — GPU servers: production throughput
vLLM. The specific reason is continuous batching: rather than waiting for one request to finish before starting the next, it runs multiple requests through the GPU concurrently, packing work into every cycle. Paged attention manages KV cache memory efficiently enough to handle serious concurrent load. llama.cpp's server was designed for a different target and doesn't do this.
The rule of thumb: prototype on Ollama, ship on vLLM or SGLang. Measure tokens per dollar before you scale anything. SGLang has pulled even with vLLM on many workloads and tends to have support for new model architectures faster.
One thing that's clearly done: TGI (Hugging Face's text-generation-inference) was archived in early 2026. If a guide still reaches for TGI, that's a reliable signal it's stale. vLLM and SGLang are where that space now lives.
Apple Silicon aside
Three paths run on M-series Macs: llama.cpp's Metal backend, Ollama's MLX engine (switched to automatically on Apple Silicon), and MLC's compiled Metal kernels. Ollama with MLX is the lowest-friction starting point and performs well for most models. If you're working with custom architectures or need WebGPU and Android targets alongside Metal, MLC's compilation pipeline gives you that flexibility at the cost of more setup.
The throughline
The runner only matters because of the interface it gives you: a CLI, a library, an HTTP endpoint, a portable binary. Match that interface to the deployment target and the pick follows.
Ollama for the CLI and laptop; llama.cpp for the library and full control; llamafile for the portable binary; vLLM for the GPU production server. Those four cover most real cases. The GUI wrappers and the Apple Silicon specialists fit the narrower situations they were built for.
You'll probably use more than one. Ollama on your laptop, llama.cpp or llamafile as the app dependency, vLLM on the server. The table above is a starting point, not a permanent commitment.
The cache
A few things worth keeping — examples, not endorsements:
- The GGUF format is the interoperability layer across the llama.cpp ecosystem. A GGUF file from Hugging Face runs in Ollama, llamafile, LM Studio, Jan, and llama.cpp's server without conversion. That portability is genuine — it's why the local LLM ecosystem held together as the tooling proliferated.
- The Ollama model registry at ollama.com/library is the fastest path to a pulled-and-quantized model. When you need a specific quantization or a model the registry doesn't carry, pull the GGUF directly from Hugging Face and point your runner at the file.
- llama-cpp-python is the Python binding with the most complete access to llama.cpp internals — sampling, logit access, embeddings, token-level control. When Ollama's HTTP interface isn't enough, this is the step down to more control.
The runner that's right for a laptop demo isn't automatically right for a staging server with concurrent users. Local tools are fast to iterate with; GPU servers are designed for real load. In a typical project you'll use both — knowing the distinction before you start saves you from rebuilding it later when you scale.