Running Large Language Models on a CPU: A Practical Q&A Guide

For a long time, conventional wisdom held that running LLMs locally required a dedicated GPU. But recent advances in model formats like GGUF and aggressive quantization (especially 4-bit variants) have made CPU-only inference not just possible, but practical for many use cases. While a GPU still offers better speeds, modern CPUs can achieve usable performance with the right models and settings. This Q&A covers what you need to know to get started, based on real testing with a typical older laptop.

Why did people think a GPU was necessary for local LLM inference?

The belief stemmed from early LLM frameworks that relied heavily on GPU tensor operations for speed. Models were large (often 7B+ parameters) and required high-precision floating-point math, which GPUs handle efficiently. CPUs, on the other hand, were slower at matrix multiplications and lacked the parallel processing power needed for real-time responses. Many guides and tutorials only covered GPU setups, reinforcing the idea that CPUs weren’t viable. However, newer optimization techniques have shifted this landscape.

Running Large Language Models on a CPU: A Practical Q&A Guide — Source: itsfoss.com

What technical changes make CPU inference possible now?

Three key developments come together: First, the GGUF format allows models to be stored in reduced precision, dramatically shrinking memory and bandwith requirements. Second, quantization reduces the number of bits per weight—for example, 4-bit quantization trims model size by about 75% compared to full 16-bit, with minimal quality loss. Third, runtimes like llama.cpp are highly optimized for CPU architectures, using techniques such as SIMD vectorization and careful memory management. These changes enable even older i5 CPUs to run small-to-medium models at speeds that are actually interactive.

What is tokens per second and why does it matter for usability?

Tokens per second (tok/s) measures how fast the model generates output. Each token is roughly a word or part of a word. A rate of 3–5 tok/s feels painfully slow, like watching typewriter text appear one keystroke at a time. Between 15–30 tok/s, responses feel responsive enough for conversation or quick lookups. Above 30 tok/s it becomes near-instant. So while a model may ‘run’ on a CPU at 4 tok/s, that speed makes it impractical. The real goal is finding the combination of model size and quantization that pushes tok/s into the usable range.

Which model sizes work best on a CPU for everyday use?

In testing, models in the 1B to 2B parameter range deliver the best balance. They are small enough to fit comfortably within 8–12GB of RAM after quantization, and they consistently achieve 15–30+ tok/s on a typical Intel i5 laptop. Despite their smaller size, they handle basic reasoning, summarization, and even light coding tasks surprisingly well. Larger models like 4B+ can be run but often dip to 4–8 tok/s, making them frustratingly slow for real-time use. If you only have a CPU, stick with 1–2B models for a pleasant experience.

What quantization level should I use for CPU inference?

Based on my experiments, Q4_K_M quantization hits the sweet spot. It offers cache misses about 60% faster than Q8 while retaining most of the quality. On my test rig (Intel i5, 12GB RAM), a 1.5B model at Q4_K_M runs at ~28 tok/s, while Q8 drops to ~14 tok/s. For most tasks the output quality difference is negligible. If you’re low on memory or need extra speed, Q4_K_S is even lighter, but the quality degrades more noticeably. Stick with Q4_K_M unless you have a specific reason to go higher or lower.

What hardware was used to test these findings?

All tests were performed on an Intel i5-8th generation laptop with 12GB of system RAM and an integrated Intel UHD Graphics 620 GPU. The GPU was intentionally left unused because, in practice, iGPU acceleration for LLMs remains experimental and often slower than CPU-only paths. This hardware is very typical of an older Linux laptop you might have lying around. The results apply directly to machines in this class. Newer CPUs with more RAM will naturally perform even better, but the key insights about model size and quantization hold regardless.

Can you run these models on a Raspberry Pi or very old hardware?

Yes, but with careful planning. A Raspberry Pi 4 or 5 with 4–8GB RAM can run 1B Q4_K_M models at roughly 3–8 tok/s—slow but usable for non-interactive tasks (batch summarization, simple chat). For older x86 hardware, if you have at least 4GB of free RAM and a CPU supporting SSE4.2, you can run a 500M–1B model at usable speeds. The key is to lower expectations: you won’t get fast real-time conversation, but you can absolutely run local AI without any GPU. Start with the smallest quantized model and test.

How do I start running LLMs on my CPU right now?

Install llama.cpp (compile from source or use a package).
Download a GGUF model from Hugging Face (e.g., TinyLlama 1.1B Q4_K_M).
Run the model with: ./main -m model.gguf -p "Your prompt" -n 256
Monitor RAM usage; aim for 80% system memory or less to avoid swapping.
Adjust quantization level if needed: try Q4_K_M first, then Q5 or Q8 if you have headroom.
Check tokens per second in the output. If below 10 tok/s, consider a smaller model or heavier quantization.

With these steps, you can have a functional local LLM running on any CPU, no GPU required.