LM Studio: a desktop application for running large language models on your own machine

A reference companion to the LM Studio project — a cross-platform GUI that bundles a model browser, chat interface, and an OpenAI-compatible local server, so anyone can run open-weight LLMs offline on Windows, macOS, or Linux.

Windows 10 & 11 macOS 13+ Linux AppImage Apple Silicon native CUDA / ROCm acceleration
What LM Studio does

A complete local LLM workstation in one application

Nine capabilities that explain why LM Studio has become a default desktop choice for running open-weight models offline, building local agents, and prototyping with private data.

Run open-weight models offline

Load Llama, Mistral, Qwen, Phi, Gemma, and other GGUF models directly from disk. Inference happens entirely on the local machine — LM Studio never sends prompts or responses to a remote inference service.

Read more on local LLM

One install per platform

Native installers for Windows, Apple Silicon and Intel Macs, and an AppImage for Linux. Each build of LM Studio uses GPU acceleration when available and falls back to CPU when it isn't.

Browse platform pages

OpenAI-compatible local server

Toggle on the LM Studio server tab and a local REST endpoint appears at http://localhost:1234. Any tool that speaks the OpenAI Chat Completions schema can connect to it with a one-line base URL change.

Server mode reference

Built-in model library

Browse a curated catalog of quantized GGUF models with hardware-fit hints. The library inside LM Studio surfaces context length, parameter count, and approximate RAM footprint before you download.

Open model library

Privacy-first by default

No prompt data, no embeddings, and no model files leave the device unless you explicitly export them. LM Studio is a strong fit for legal teams, healthcare research, and any workflow that needs offline guarantees.

Read security policy

GPU acceleration that just works

CUDA on NVIDIA, Metal on Apple Silicon, ROCm on supported AMD parts, and Vulkan as a portable fallback. LM Studio detects the right backend automatically and exposes a layer slider for hybrid CPU/GPU offload.

Tuning notes

Quantization-aware loading

Q4_K_M, Q5_K_M, Q6_K, Q8_0 and full FP16 weights all load through the same dialog. LM Studio reports the expected quality / speed trade-off before the model finishes downloading.

Quantization guide

Documentation that scales with you

Beginner pages walk through the very first prompt; deeper material covers prompt templates, chat presets, server mode, and headless scripting. Every feature in LM Studio has a corresponding written reference.

Open documentation

Active community ecosystem

Plugins, model presets, prompt packs, and integration recipes are shared openly. The LM Studio community Discord and the public issue tracker turn user feedback into release notes.

GitHub presence

Why a desktop app for local LLMs — and why LM Studio specifically

Local inference used to be a terminal-only sport. LM Studio replaced that with a visual workstation, which is why it crosses over to product managers, data analysts, and writers as easily as engineers.

Two years ago, running a 7B-parameter model on a laptop meant cloning a C++ inference repo, fighting CMake flags, and hand-converting weights between half a dozen tensor formats. The barrier to entry was not the hardware; it was the toolchain. LM Studio collapsed that barrier into a single double-click. Pick a model in the in-app library, watch a progress bar, then start chatting. Underneath the friendly surface, LM Studio still uses the same GGUF inference primitives that power the open-source ecosystem — just wrapped in a UI that handles the rough edges.

The case for local inference is not anti-cloud; it is about choice. There are categories of work where sending text to a hosted endpoint is impossible by policy: medical notes, attorney–client material, internal HR documents, source code under NDA. There are also use cases where round-trip latency or per-token cost makes local inference simply faster or cheaper. LM Studio sits in the middle of that spectrum — comfortable for a hobbyist tinkering on a Saturday, and serious enough that consultants pull it out at client sites where Wi-Fi is unreliable or restricted.

The architecture inside LM Studio is deliberately boring. There is one process that hosts a model in memory, one process that draws the UI, and one optional process that exposes a local HTTP API on a port you choose. Models are kept as ordinary files in a folder you can open in Finder or File Explorer. Prompts are not phoned home. There is no telemetry that captures the contents of conversations. That predictability is part of the appeal.

The shape of a typical session

A first-time user usually opens LM Studio, lands on the discover tab, and types a model name like llama-3-8b-instruct or mistral-7b-instruct. The library shows quantized variants ranked by file size with hardware-fit hints next to each one. After download, a one-click Eject button frees memory; loading another model is as fast as the SSD can read it.

From there, three workflows dominate. The first is interactive chat, where a session window holds a system prompt, a sampling preset, and a scrollable transcript. The second is server mode, used by anyone connecting LM Studio to a code editor, a custom agent, or a notebook. The third is presets — reusable bundles of system prompt, temperature, top-p, and stop tokens that can be swapped between models without losing a session.

Where this site fits

This site is an independent reference. It mirrors the structure of the LM Studio product so visitors can find the page that matches what they are doing right now: installing on a specific OS, comparing it to a competitor, troubleshooting a model that won't load, or wiring the server up to an external client. Internal links keep related topics close together; an external link goes out to a public standards body or research source rather than a vendor that wants to sell you something.

Picking hardware that matches your model goals

There is no single “best” rig for local inference — but there are clear thresholds where adding RAM, swapping a GPU, or stepping up to Apple Silicon unlocks a noticeably larger class of models inside the application.

The most common question new users ask is some variant of “will my laptop run a 7B model?” In almost every case the answer is yes, provided the laptop has 16 GB of unified memory or 16 GB of system RAM with a recent CPU. A Q4 quantization of a 7B model occupies roughly 4–5 GB on disk and a similar footprint in working memory, which leaves headroom for the operating system and a browser. Step up to 13B and the comfortable RAM target moves to 24 GB. For 30B–34B at usable speeds, 32 GB is the realistic floor, and 70B class models genuinely benefit from 64 GB or more, especially if you want long context windows.

GPU acceleration changes the picture. On Apple Silicon, the unified memory architecture means the same chip handles both layers and the operating system — an M-series machine with 32 GB of unified memory will happily host a 13B model with comfortable headroom, and a 64 GB machine starts to make 30B feasible at acceptable token rates. On NVIDIA hardware, dedicated VRAM is the ceiling: an 8 GB card can fully host a quantized 7B model, a 12 GB card opens up Q4 13B, 16 GB makes Q5 13B comfortable, and 24 GB cards begin to make 30B class models a reasonable everyday workload. AMD parts with mature ROCm support sit roughly alongside the equivalent NVIDIA tier; Vulkan-only fallbacks lag behind for now but still beat CPU-only by a wide margin.

Inside the application, the layer-offload slider lets you split a model between GPU and CPU. If a model is just a little too big for VRAM, dialing the slider down by a few layers will keep most of the workload on the GPU while spilling the remainder to system RAM — the result is slower than a fully resident model, but still meaningfully faster than CPU-only inference. That hybrid mode is one of the reasons LM Studio feels forgiving on real-world hardware where models and capacity rarely match perfectly.

Workflows that justify a desktop runtime

Three concrete patterns explain who picks LM Studio over an API provider: offline iteration on sensitive prompts, building local-first apps against a stable endpoint, and prepping models for production deployment elsewhere.

The first pattern is offline prompt iteration. A consultant on a client site, a researcher on a flight, or a hospital analyst behind a strict firewall all share the same problem: the cloud is not on the menu. With a model already cached locally, LM Studio handles dozens of iterations on system prompts and few-shot examples without touching the network. When the work is done, the conversation log can be exported as JSON or markdown for handoff — the prompts that worked travel back to the office, while the underlying data stays where it belongs.

The second pattern is local-first application development. Anyone building a personal agent, a tool for a small team, or a desktop integration that bundles AI features benefits from a stable endpoint that does not bill per token. Pointing an agent framework at http://localhost:1234/v1/chat/completions turns the developer’s machine into both the runtime and the test bed. When a user later flips a setting to point at a hosted endpoint instead, nothing else in the code changes — LM Studio mirrors the OpenAI schema closely enough that the swap is one configuration line.

The third pattern is pre-production model evaluation. Before committing a model to a production GPU server, teams will frequently load several quantization levels in LM Studio side-by-side, run a fixed prompt suite, and judge the trade-off between speed, memory footprint, and qualitative output. The desktop UI makes it trivial to swap presets and rerun — that turn-around is what makes the application a viable evaluation rig, not just a chat toy.

Across all three workflows, the common thread is control. Local inference is no longer a hobby project; it is a legitimate engineering discipline with its own benchmarks, tuning knobs, and deployment patterns. LM Studio sits at the friendly end of that discipline, lowering the toolchain barrier while still exposing the levers that experienced users want.

How to navigate this LM Studio reference

If you are brand new, the LM Studio quickstart page walks through the first ten minutes of the app. If you already know what you want, jump straight to the LM Studio download page or the platform-specific install guides for Windows, Mac, and Linux. Developers who care about the LM Studio API or LM Studio server should head to the capabilities silo. Anyone weighing LM Studio against another runtime can read the LM Studio vs Ollama comparison, the LM Studio alternative roundup, and the LM Studio GitHub presence overview. The LM Studio documentation index at the top of the resources column maps every topic on the site, and the LM Studio tutorial introduces a complete worked example end-to-end.

Practitioner testimonials

"For research that touches patient data, every cloud option was a non-starter. A laptop running LM Studio with a 7B model is enough for the prompt iteration we do, and it never leaves the device."
"The model library inside the app is the difference. Junior engineers don't have to learn the GGUF rabbit hole on day one — they just pick a model with a green hardware badge and start prompting."
"We pre-load three quantized models on every analyst laptop using LM Studio. Offline flights, customer sites, secure rooms — the workflow is identical, no VPN gymnastics needed."

Frequently asked questions

Answers to the seven questions new visitors most often ask before installing or evaluating LM Studio.

Ready to run language models on your own machine?

Pick a platform, grab the latest installer, and have a model loaded inside ten minutes.

Open the download page