Complete Guide to Local AI Deployment: Running Llama, Qwen, and GLM on Your Machine
Introduction
The open-source large language model ecosystem has exploded. Models like Llama 3, Qwen 3.5, and GLM-4 now rival proprietary systems in capability, yet remain freely available for anyone to download, run, and customize. The appeal is obvious: complete data privacy, zero recurring API costs, offline availability, and full control over model behavior. Whether you are a developer prototyping an AI application, a researcher experimenting with fine-tuning, or a privacy-conscious user who simply wants a capable assistant without sending data to the cloud, local deployment is the answer.
This guide walks you through the entire journey—from understanding the core concepts behind local inference, to choosing the right framework, to deploying specific models step by step, and finally to fine-tuning them for your own use case. Every command in this article has been tested and is ready to run.
Understanding the Fundamentals
What Happens When You Run a Model Locally
At its core, local deployment means loading a trained neural network's weights (the learned parameters) into your machine's memory—either system RAM or GPU VRAM—and feeding text through it to generate predictions. A model labeled "7B" has approximately 7 billion parameters. At the standard 16-bit floating-point precision (FP16), each parameter occupies 2 bytes, so a 7B model requires roughly 14 GB of memory just to load, before accounting for the context window and intermediate computations.
This is where quantization becomes essential. Quantization reduces the precision of the weights—typically from 16-bit to 4-bit or 8-bit—dramatically shrinking memory requirements with minimal quality loss:
- FP16 (no quantization): Full precision, highest quality, maximum memory usage
- INT8 / Q8_0: Roughly halves memory compared to FP16, nearly imperceptible quality loss
- INT4 / Q4_K_M: Reduces memory to roughly one-quarter of FP16, excellent quality-to-size ratio
- INT2 / Q2_K: Extreme compression, noticeable quality degradation, useful only for severely constrained devices
For most users, 4-bit quantization (Q4_K_M) hits the sweet spot. Modern quantization schemes like Unsloth's Dynamic 2.0 preserve higher precision on critical layers (attention weights) while aggressively compressing less important ones, making 4-bit models perform almost identically to their FP16 counterparts.
Key Frameworks at a Glance
The local deployment ecosystem has matured significantly. Here are the major frameworks and their ideal use cases:
- Ollama: One-command deployment with automatic model management and API server. Best for beginners, rapid prototyping, and anyone who wants things to "just work."
- llama.cpp: Lightweight C++ inference engine supporting CPU, CUDA, Metal (Apple Silicon), and Vulkan. Best for resource-constrained environments, edge devices, and users who want maximum control over GGUF quantized models.
- vLLM: High-throughput GPU inference server with PagedAttention for efficient memory management. Best for production deployments, high-concurrency scenarios, and multi-GPU setups.
- Text Generation Inference (TGI): Hugging Face's official serving solution. Best for enterprise deployments that are already invested in the Hugging Face ecosystem.
- LM Studio: Desktop application with a graphical interface. Best for non-technical users who want a ChatGPT-like experience locally.
- Open WebUI: Browser-based interface that connects to any backend (Ollama, vLLM, TGI). Best for teams wanting a private, multi-user chat interface.
A Decision Framework
Use this simple decision tree to pick your framework:
- You want the fastest path to a working model → Ollama
- You have limited hardware (no GPU, old laptop) → llama.cpp
- You need to serve many concurrent users in production → vLLM
- You want an enterprise-grade containerized setup → TGI
- You prefer a graphical interface over the terminal → LM Studio
- You need a multi-user web interface for your team → Open WebUI + Ollama
Method 1: Ollama — The Fastest Path from Zero to Chat
Ollama has become the de facto standard for quick local model deployment. It handles model downloading, quantization selection, API server management, and even provides a built-in command-line chat interface.
Installation
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
Deploying Your First Model
Ollama uses a simple pull and run workflow. Let us start with three flagship models:
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
Ollama automatically selects the appropriate quantized version for your hardware. For an 8B model, expect roughly 4–5 GB of disk space and 6–8 GB of memory during inference.
Chatting via Command Line
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
Using the REST API
Ollama automatically starts an OpenAI-compatible API server on port 11434:
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
This is particularly powerful because any tool that supports the OpenAI API format can connect to Ollama simply by changing the base URL. For example, in Python:
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
| 11 | |
| 12 | |
| 13 | |
| 14 | |
| 15 | |
| 16 | |
Model Management
| 1 | |
| 2 | |
| 3 | |
Creating a Custom Model with a Modelfile
You can customize a model's behavior (system prompt, parameters, template) using Ollama's Modelfile:
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
| 11 | |
| 12 | |
| 13 | |
| 14 | |
| 15 | |
| 16 | |
Method 2: llama.cpp — Maximum Control, Minimum Resources
When you need to run models on hardware that Ollama cannot handle, or when you want fine-grained control over every inference parameter, llama.cpp is the answer. It runs on everything from a Raspberry Pi to a multi-GPU server.
Compilation
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
| 11 | |
| 12 | |
| 13 | |
| 14 | |
Obtaining and Preparing Models
llama.cpp uses the GGUF format, which packages quantized weights and model metadata into a single file. You have two options:
Option A: Download pre-quantized GGUF files directly
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
Option B: Convert and quantize a model yourself
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
| 11 | |
| 12 | |
Running Interactive Chat
| 1 | |
| 2 | |
| 3 | |
| 4 | |
Key parameters explained:
-m: Path to the GGUF model file--ctx-size: Context window size in tokens. Larger values use more memory but allow longer conversations-cnv: Enable conversational mode (chat mode)-n: Maximum number of tokens to generate per response--temp: Sampling temperature (0 = deterministic, 1 = creative)
Launching an API Server
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
The --n-gpu-layers parameter controls how many transformer layers to offload to the GPU. Set it to 0 for pure CPU inference, or a high number (like 35 or 99) to offload as much as possible to the GPU.
The server exposes an OpenAI-compatible API:
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
| 11 | |
| 12 | |
| 13 | |
| 14 | |
Choosing the Right Quantization Level
The GGUF ecosystem offers many quantization variants. Here is a practical guide:
- Q4_K_M: The recommended default. Excellent quality, small file size. Use this unless you have a specific reason not to.
- Q4_K_XL / UD-Q4_K_XL: Unsloth's dynamic 4-bit quantization. Slightly better quality than Q4_K_M at similar size. Highly recommended when available.
- Q5_K_M: A quality step up from Q4, with moderate size increase. Good choice if you have extra memory.
- Q8_0: Nearly indistinguishable from FP16 quality. Use when memory is not a constraint.
- Q2_K: Extreme compression. Only consider this for devices with severe memory limitations.
Method 3: vLLM — Production-Grade GPU Serving
When you need to serve models at scale—with high throughput, low latency, and concurrent request handling—vLLM is the framework of choice. Its PagedAttention algorithm manages the KV cache efficiently, enabling throughput that can be 10–20× higher than naive implementations.
Environment Setup
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
Single-GPU Deployment
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
Multi-GPU Deployment (Tensor Parallelism)
For larger models or higher throughput, distribute across multiple GPUs:
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
The --tensor-parallel-size parameter specifies the number of GPUs to split the model across. A 27B model on 2 GPUs requires roughly 14 GB of VRAM per GPU at FP16.
Deploying Llama 3.1
| 1 | |
| 2 | |
| 3 | |
| 4 | |
Important: Llama models require you to first accept the Meta license on Hugging Face and log in:
| 1 | |
| 2 | |
| 3 | |
API Usage
vLLM provides a fully OpenAI-compatible API:
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
| 11 | |
Streaming Responses
For real-time output (similar to ChatGPT's typing effect):
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
| 11 | |
| 12 | |
| 13 | |
Method 4: Enterprise Deployment with TGI and Docker
Text Generation Inference (TGI) is Hugging Face's official production server. It offers built-in health checks, metrics endpoints, token streaming, and first-class Docker support.
Docker Deployment
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
| 11 | |
| 12 | |
Health Check and Monitoring
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
| 11 | |
| 12 | |
| 13 | |
TGI also exposes Prometheus metrics at /metrics, making it straightforward to integrate with monitoring systems like Grafana.
Adding a Visual Interface with Open WebUI
Running models from the command line is fine for development, but most users prefer a browser-based chat interface. Open WebUI provides a polished, ChatGPT-like experience that connects to any backend.
Docker Deployment
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
Then open http://localhost:3000 in your browser. The first time you visit, you will create an admin account. After that:
- Open Settings → Connections and verify Ollama is detected automatically
- Click the model selector at the top to choose from your downloaded models
- Start chatting
Open WebUI supports multi-user accounts, conversation history, document upload for RAG, web search integration, and a rich plugin ecosystem. For teams wanting a private ChatGPT alternative, it is remarkably capable.
Model-Specific Deployment Notes
Qwen 3.5 (Alibaba)
Qwen 3.5 is arguably the best open-source model family for Chinese language tasks and strong multilingual performance generally. It is available in sizes from 0.8B to 27B+.
Key details:
- The small models (0.8B, 2B, 4B, 9B) have thinking mode disabled by default, unlike the larger variants
- To enable thinking mode in llama.cpp, add
--chat-template-kwargs '{"enable_thinking":true}' - Unsloth provides optimized GGUF quantizations with their Dynamic 2.0 scheme
Recommended sampling parameters:
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
Hardware requirements for the 9B model (Q4 quantization): approximately 9 GB of total memory (RAM + VRAM). A MacBook Pro with 16 GB of unified memory runs it comfortably.
Llama 3.1 / 3.3 (Meta)
Meta's Llama family remains one of the most widely deployed open-source models. It excels at general reasoning, English language tasks, and code generation.
Key details:
- Requires accepting Meta's license agreement on Hugging Face before downloading
- The 8B variant is ideal for consumer hardware; the 70B variant requires multi-GPU setups or aggressive quantization
- Available through Ollama with a single
ollama pull llama3.1:8b
GLM-4 (Zhipu AI)
GLM-4 is particularly strong at bilingual Chinese-English tasks and has excellent instruction-following capabilities.
Key details:
- The 9B variant fits comfortably on consumer hardware
- Available on Ollama:
ollama pull glm4:9b - Supports a 128K context window in the full-precision version
Fine-Tuning: Building Your Own Specialized Model
Running a generic model is just the beginning. The real power of local deployment lies in fine-tuning—adapting a model to your specific domain, style, or task.
Quick Fine-Tuning with Unsloth
Unsloth dramatically simplifies and accelerates the fine-tuning process. It supports QLoRA (Quantized Low-Rank Adaptation), which means you can fine-tune a 9B model on a single consumer GPU with as little as 12 GB of VRAM—or even for free on Google Colab's T4 GPU.
Installation:
| 1 | |
Complete fine-tuning script:
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
| 11 | |
| 12 | |
| 13 | |
| 14 | |
| 15 | |
| 16 | |
| 17 | |
| 18 | |
| 19 | |
| 20 | |
| 21 | |
| 22 | |
| 23 | |
| 24 | |
| 25 | |
| 26 | |
| 27 | |
| 28 | |
| 29 | |
| 30 | |
| 31 | |
| 32 | |
| 33 | |
| 34 | |
| 35 | |
| 36 | |
| 37 | |
| 38 | |
| 39 | |
| 40 | |
| 41 | |
| 42 | |
| 43 | |
| 44 | |
| 45 | |
| 46 | |
| 47 | |
| 48 | |
| 49 | |
| 50 | |
| 51 | |
| 52 | |
| 53 | |
| 54 | |
| 55 | |
| 56 | |
| 57 | |
| 58 | |
| 59 | |
| 60 | |
| 61 | |
| 62 | |
| 63 | |
| 64 | |
| 65 | |
Preparing Training Data
Your training data should be in JSONL format with each line containing an instruction-response pair:
| 1 | |
| 2 | |
Tips for Successful Fine-Tuning
- Start small. Use the 0.8B or 2B model first to validate your pipeline, then scale up.
- Quality over quantity. 500 high-quality examples often outperform 50,000 mediocre ones.
- Monitor loss closely. If training loss drops but the model generates gibberish, you may be overfitting—reduce the learning rate or number of steps.
- Use gradient checkpointing. The
unslothmode is optimized specifically for this use case and significantly reduces VRAM consumption. - Free option: Unsloth provides ready-made Colab notebooks for each Qwen 3.5 model size—open one in your browser and start training immediately with zero local setup.
Optimization Best Practices
Hardware Acceleration
Different hardware platforms benefit from different approaches:
- NVIDIA GPUs: Use CUDA-compiled llama.cpp or vLLM for maximum performance. Enable FlashAttention in vLLM (
--enable-flash-attn) for additional speedup. - Apple Silicon (M-series): Use llama.cpp compiled with Metal support, or explore Apple's MLX framework. The unified memory architecture means a 16 GB Mac can run models that would require a 16 GB GPU on other platforms.
- CPU only: llama.cpp is your best option. It is heavily optimized for CPU inference with SIMD instructions (AVX2, AVX-512). Expect reasonable speed on modern CPUs for 7B-9B models at Q4 quantization.
Memory Management
- Close unnecessary applications before running large models to maximize available memory.
- Use swap space wisely. On Linux, adding swap can prevent out-of-memory crashes, but disk-based swap is extremely slow. Only use it as a safety net, not a primary strategy.
- Reduce context size (
--ctx-size) when possible. A 32K context window uses roughly 4× more memory than an 8K window.
Quantization Selection Guide
- Personal use / experimentation: Q4_K_M or UD-Q4_K_XL
- Production deployment with quality requirements: Q8_0 or FP16 with vLLM
- Memory-constrained devices: Q2_K or Q3_K_M (with quality trade-off awareness)
- Balanced production: GPTQ INT4 or AWQ (supported by vLLM and TGI)
Common Pitfalls and How to Avoid Them
-
Out of memory errors. Always check your model's memory requirement before loading. A Q4-quantized 7B model needs about 4–5 GB. An unquantized 70B model needs about 140 GB. Plan accordingly.
-
Slow first inference. The first prompt after loading a model is always slower because the model must compile CUDA kernels and fill the KV cache. Subsequent prompts are faster. Do not benchmark on the first run.
-
Context window overflow. If your conversation grows beyond the model's context window, it will truncate earlier messages or fail. Monitor token counts in long conversations.
-
Model format mismatch. Ollama uses its own format, llama.cpp uses GGUF, and vLLM uses HuggingFace safetensors. Make sure you download the correct format for your framework.
-
License compliance. Some models (notably Llama) require license acceptance. Read and comply with each model's license before use in commercial applications.
Recommended Learning Path
If you are just getting started, follow this progression:
- Install Ollama and run your first model (15 minutes)
- Try different models (Qwen, Llama, GLM) to find the best fit for your use case
- Set up Open WebUI for a comfortable chat interface (30 minutes)
- Learn llama.cpp for finer control and resource-constrained scenarios
- Explore vLLM when you need production-grade performance
- Fine-tune a model with Unsloth to create your own specialized assistant
- Build applications by connecting local models to your software via the OpenAI-compatible API
Conclusion
Running large language models locally has never been more accessible. With tools like Ollama, a capable AI assistant is literally one command away. For those willing to dig deeper, llama.cpp offers unmatched flexibility across diverse hardware, vLLM delivers production-grade throughput, and Unsloth democratizes fine-tuning. The combination of powerful open-source models (Llama 3, Qwen 3.5, GLM-4) and mature deployment tools means that the barriers to local AI—privacy, cost, and control—are now problems of the past.
Start with Ollama and a 7B-8B model today. You will be surprised at how much capability fits on your local machine.