Local-first LLM runners compared: Ollama vs. llama.cpp vs. the rest

How to pick a local LLM runner by the job, not the benchmark: Ollama, llama.cpp, and the alternatives across local hacking, embedding in an app, CPU-only boxes, and GPU servers.

The pick depends on the job — not the throughput benchmark. Hacking on your laptop is a different job from embedding a model in an app, which is different from running inference on a CPU-only box, which is different again from serving real concurrent traffic on a GPU cluster. Each has a clear answer, and the answer is rarely the same tool.

Job	Reach for	Alternative when…
Local hacking / run a model on your laptop	Ollama	LM Studio or Jan for a GUI instead of the terminal
Embedding in an app / programmatic	llama.cpp server + bindings	llamafile for a single portable binary
CPU-only box	llama.cpp + GGUF quantization	Ollama for easier model management on the same box
GPU server / production throughput	vLLM	SGLang for latency-sensitive work or same-day new model support
Apple Silicon	Ollama (MLX engine) or llama.cpp Metal	MLC for custom compilation targets

The cast, by layer

Before the jobs: what layer each runner actually occupies, because several of them are the same engine wearing different clothes.

llama.cpp is the engine most of this ecosystem runs on. It's a C++ inference library built around the GGUF format, a quantized model format that makes LLMs fit on consumer hardware. llama.cpp ships a --server flag that starts an OpenAI-compatible HTTP endpoint, plus language bindings (llama-cpp-python, node-llama-cpp). It runs on Mac, Linux, Windows, CPU-only, Metal, and CUDA. "Runs everywhere" means some assembly required.

Ollama wraps llama.cpp — and an MLX engine for Apple Silicon — in a daemon with a model registry, a CLI, and an OpenAI-compatible API at localhost:11434. It trades one axis of control for a dramatically shorter path from zero to a running model. Most people who describe themselves as "using llama.cpp" are running Ollama.

LM Studio and Jan are desktop GUI applications with llama.cpp underneath. For comparing model outputs in a chat interface without touching a terminal, or evaluating models with people who don't live in a terminal. GPT4All belongs in this tier too: another offline-first GUI wrapper.

llamafile is a packaging decision. llama.cpp compiled with Cosmopolitan libc into a single executable that carries its model weights and runs on Linux, Mac, and Windows without any installation step. Ship it to a machine you don't control, run it in CI, hand it to a non-engineer — it works.

vLLM is the outlier — not built on llama.cpp, targets CUDA GPUs, and designed for production throughput rather than local use. Continuous batching and paged attention let it handle many concurrent requests efficiently in ways llama.cpp's server wasn't designed for. The 2026 open-source AI stack reaches for vLLM (or SGLang) when you're serving real traffic.

MLX is Apple's machine learning framework; the MLX-LLM ecosystem runs models natively on Apple Silicon unified memory without going through llama.cpp. MLC (Machine Learning Compilation) takes a compilation approach: it targets Metal, CUDA, and WebGPU from the same model definition, with more setup and more deployment flexibility.

Job 1 — Local hacking: just run the model

Ollama. One command:

ollama run llama3.2:3b

It downloads the GGUF, picks a sensible quantization, and starts serving at localhost:11434 on an OpenAI-compatible API. On Apple Silicon it switches to the MLX engine automatically, which runs noticeably faster than the Metal backend for most models. The CLI also manages multiple pulled models, so switching between them doesn't require re-downloading.

When to pick LM Studio or Jan: you want a side-by-side chat interface, you're evaluating models with a team that doesn't live in a terminal, or you want a visual resource graph during inference. The tradeoff is distance from the knobs — quantization, context length, sampling parameters — that you'd have with Ollama or llama.cpp directly.

Job 2 — Embedding in an app: programmatic and portable

When you're building something that calls a local model, you want a process you own and can iterate against without network round-trips or API key management. Two clear paths.

llama.cpp's server, for when you want a stable HTTP endpoint:

llama-server --model my-model.gguf --port 8080

OpenAI-compatible, predictable, fully under your control. Or skip HTTP and use llama-cpp-python as a library, which gives you sampling parameters, logit access, and embedding generation that an HTTP adapter doesn't expose.

llamafile, for when you need to distribute the runner alongside the model:

chmod +x mymodel.llamafile && ./mymodel.llamafile

A single file ships the weights and the server together. Useful when you're distributing to environments where you can't install system packages: CI runners, colleague machines, edge nodes.

Ollama's OpenAI-compatible API works fine for programmatic use too and is the lowest-friction starting point. The operational distinction: Ollama is a shared daemon with model lifecycle management, a dependency on a running service. llama.cpp's server or llamafile are processes you start and own per deployment.

Job 3 — CPU-only boxes: no GPU

llama.cpp, and specifically GGUF quantization, is why CPU inference is practical at all. At 4-bit quantization (Q4_K_M), a 7B parameter model fits in roughly 4–5 GB of RAM and runs inference without any GPU. You're trading throughput for hardware accessibility, not model quality — output at Q4_K_M is close to full precision for most tasks.

Ollama is still llama.cpp on a CPU-only box; it works fine and manages models more conveniently. Use whichever interface fits the rest of the system. The point is that GGUF quantization is what unlocks inference on hardware without a GPU, and that's a llama.cpp contribution regardless of which runner is wrapping it.

Job 4 — GPU servers: production throughput

vLLM. The specific reason is continuous batching: rather than waiting for one request to finish before starting the next, it runs multiple requests through the GPU concurrently, packing work into every cycle. Paged attention manages KV cache memory efficiently enough to handle serious concurrent load. llama.cpp's server was designed for a different target and doesn't do this.

The rule of thumb: prototype on Ollama, ship on vLLM or SGLang. Measure tokens per dollar before you scale anything. SGLang has pulled even with vLLM on many workloads and tends to have support for new model architectures faster.

One thing that's clearly done: TGI (Hugging Face's text-generation-inference) was archived in early 2026. If a guide still reaches for TGI, that's a reliable signal it's stale. vLLM and SGLang are where that space now lives.

Apple Silicon aside

Three paths run on M-series Macs: llama.cpp's Metal backend, Ollama's MLX engine (switched to automatically on Apple Silicon), and MLC's compiled Metal kernels. Ollama with MLX is the lowest-friction starting point and performs well for most models. If you're working with custom architectures or need WebGPU and Android targets alongside Metal, MLC's compilation pipeline gives you that flexibility at the cost of more setup.

The throughline

The runner only matters because of the interface it gives you: a CLI, a library, an HTTP endpoint, a portable binary. Match that interface to the deployment target and the pick follows.

Ollama for the CLI and laptop; llama.cpp for the library and full control; llamafile for the portable binary; vLLM for the GPU production server. Those four cover most real cases. The GUI wrappers and the Apple Silicon specialists fit the narrower situations they were built for.

You'll probably use more than one. Ollama on your laptop, llama.cpp or llamafile as the app dependency, vLLM on the server. The table above is a starting point, not a permanent commitment.

The cache

A few things worth keeping — examples, not endorsements:

The GGUF format is the interoperability layer across the llama.cpp ecosystem. A GGUF file from Hugging Face runs in Ollama, llamafile, LM Studio, Jan, and llama.cpp's server without conversion. That portability is genuine — it's why the local LLM ecosystem held together as the tooling proliferated.
The Ollama model registry at ollama.com/library is the fastest path to a pulled-and-quantized model. When you need a specific quantization or a model the registry doesn't carry, pull the GGUF directly from Hugging Face and point your runner at the file.
llama-cpp-python is the Python binding with the most complete access to llama.cpp internals — sampling, logit access, embeddings, token-level control. When Ollama's HTTP interface isn't enough, this is the step down to more control.

The runner that's right for a laptop demo isn't automatically right for a staging server with concurrent users. Local tools are fast to iterate with; GPU servers are designed for real load. In a typical project you'll use both — knowing the distinction before you start saves you from rebuilding it later when you scale.