LM Studio performance: tokens per second and tuning

Inference speed in LM Studio is determined by GPU layer offload, context length, KV cache headroom, batch size, and thread count. This page explains each lever and gives benchmark figures across representative hardware configurations.

Synopsis Notes

The fastest single change you can make in LM Studio is to push as many model layers onto the GPU as VRAM allows. Full GPU residency produces 3–10x more tokens per second than CPU-only inference. After that, reduce context length to free KV cache headroom, match batch size to your usage pattern, and set CPU threads to physical core count only.

GPU layer offload

The GPU Layers slider in the model load dialog determines how many transformer layers run on GPU versus CPU. Every layer added to the GPU accelerates inference; a fully GPU-resident model is always faster than a hybrid one.

Modern transformer models are stacks of identical layers. A 7B-parameter Llama model has 32 layers; a 13B model has 40; a 70B model has 80. Each layer is a bundle of weight matrices and attention operations. When a layer runs on GPU, it benefits from thousands of parallel CUDA, Metal, or ROCm cores that execute matrix multiplications simultaneously. When a layer runs on CPU, the same operations run on far fewer cores with far lower memory bandwidth.

The LM Studio layer-offload slider lets you split a model that is too large for full GPU residency. If a Q4_K_M 13B model needs 9 GB of VRAM and your card has 8 GB, you can offload 35 of the 40 layers to GPU and let the remaining 5 run on CPU. The result is slower than full GPU residency but substantially faster than CPU-only. The speed penalty scales roughly with the fraction of layers on CPU: 5 CPU layers out of 40 is a modest overhead; 20 out of 40 is a significant one.

On Apple Silicon, the unified memory architecture means GPU and CPU share the same physical RAM. LM Studio's Metal backend on Apple chips is highly optimised, and the layer concept still applies — but you are never penalised for hybrid offloading the way you are on discrete GPU systems, because the data never crosses a PCIe bus. Apple M-series machines with 32 GB or more of unified memory are the most forgiving environment for running large models that exceed a discrete GPU's VRAM ceiling.

Context length and KV cache cost

Context length determines how many tokens the model can see in one pass. Longer context costs VRAM proportionally because the KV cache grows with every token in the window. Shorter context means more room for layers on the GPU.

The KV (key-value) cache stores the attention state for every token in the current context window. Each additional token in the window adds a fixed amount of memory per layer. For a 7B model, a 4096-token context occupies roughly 1–2 GB of additional VRAM on top of the model weights. A 32K context window on the same model needs 8–16 GB for the cache alone — which on an 8 GB GPU would leave no room for the model itself.

The practical advice is to set context length to the shortest value that covers your actual use case. Chat sessions rarely benefit from more than 4096 tokens unless you are injecting long system prompts or document excerpts. Code generation tasks can often fit in 2048 tokens. Only document summarisation, long-form analysis, and multi-turn agentic workflows genuinely need 8K or 16K contexts. Every kilotoken of context you shed buys back VRAM that can be reallocated to additional GPU layers, directly improving token generation speed.

Batch size and concurrent requests

Batch size controls how many prompt tokens are processed in parallel during the prefill phase. Larger batches amortise GPU launch overhead across more work; for server mode with concurrent users, higher batch sizes improve aggregate throughput without increasing latency proportionally.

For interactive single-user chat, the default batch size of 512 is adequate. The prefill phase (processing the prompt) happens once per message, and the generation phase produces one token at a time regardless of batch size. For the LM Studio server under concurrent load — multiple clients sending requests simultaneously — raising the batch size to 1024 or 2048 allows the GPU to process more prompt tokens per pass, improving aggregate throughput. The trade-off is marginally higher peak memory usage during prefill. On GPUs with 16 GB or more of VRAM, a batch size of 2048 is generally safe with 7B and 13B models at typical context lengths.

CPU threads and hybrid inference

CPU thread count matters for CPU-only and hybrid GPU+CPU inference. Set it to the physical core count, not the logical thread count, because hyper-threading does not improve inference throughput and can cause scheduling contention.

Hyper-threading presents two logical processors per physical core. For many workloads, this improves throughput because threads waiting on memory can yield to other threads. Inference is different: the operations are dense matrix multiplications that keep every core busy. Adding hyper-threaded "cores" to an already saturated physical core introduces context-switching overhead without increasing compute capacity. The result is often 5–15% slower inference than using physical cores alone. In LM Studio Settings, find the CPU Threads option and enter the physical core count of your processor — this figure is printed in the CPU specification sheet and differs from the thread count shown in Task Manager or Activity Monitor.

In hybrid GPU+CPU mode, the CPU thread setting still matters for the layers that run on CPU. However, if most layers are on the GPU, the CPU layers represent a smaller fraction of total compute and the thread setting has a proportionally smaller impact. A setting of 4–8 physical CPU threads is usually optimal for hybrid mode on most consumer machines.

Hardware benchmark reference

Approximate token generation rates across four hardware configurations give a practical sense of what to expect before buying or upgrading. All figures use Q4_K_M quantization and a 2048-token context window.

Approximate LM Studio tokens-per-second for Q4_K_M models at 2048-token context across representative hardware
Hardware 7B Q4 (t/s) 13B Q4 (t/s) 30B Q4 (t/s)
Apple M1 Pro (16 GB) ~28–34 ~14–18 CPU offload only; ~4–6
Apple M3 Max (48 GB) ~55–70 ~38–50 ~18–26
NVIDIA RTX 3090 (24 GB VRAM) ~60–80 ~40–55 ~16–22
NVIDIA RTX 4090 (24 GB VRAM) ~90–120 ~60–80 ~22–32

These figures are community-reported averages and will vary with driver version, OS memory pressure, and the specific model file. They are meant as orientation points, not guarantees. For 30B models on the M1 Pro, full GPU residency is not possible at 16 GB; the figures shown reflect maximum layer offload with CPU overflow. The M3 Max at 48 GB handles 30B Q4 comfortably in GPU-only mode. The RTX 3090 and 4090 both have 24 GB VRAM, which is enough for 30B Q4 with context at 2048 tokens, but a longer context window reduces the layer-resident count and degrades speed. See NIST’s AI Risk Management Framework for guidance on evaluating AI systems in production environments and Stanford’s HAI research group for ongoing benchmarking methodology in the field.

Practitioner testimonial

Frequently asked questions

Answers to the four questions most often asked about improving inference speed in LM Studio.