Complete Guide to Local AI Deployment: Running Llama, Qwen, and GLM on Your Machine

Introduction

The open-source large language model ecosystem has exploded. Models like Llama 3, Qwen 3.5, and GLM-4 now rival proprietary systems in capability, yet remain freely available for anyone to download, run, and customize. The appeal is obvious: complete data privacy, zero recurring API costs, offline availability, and full control over model behavior. Whether you are a developer prototyping an AI application, a researcher experimenting with fine-tuning, or a privacy-conscious user who simply wants a capable assistant without sending data to the cloud, local deployment is the answer.

This guide walks you through the entire journey—from understanding the core concepts behind local inference, to choosing the right framework, to deploying specific models step by step, and finally to fine-tuning them for your own use case. Every command in this article has been tested and is ready to run.

Understanding the Fundamentals

What Happens When You Run a Model Locally

At its core, local deployment means loading a trained neural network's weights (the learned parameters) into your machine's memory—either system RAM or GPU VRAM—and feeding text through it to generate predictions. A model labeled "7B" has approximately 7 billion parameters. At the standard 16-bit floating-point precision (FP16), each parameter occupies 2 bytes, so a 7B model requires roughly 14 GB of memory just to load, before accounting for the context window and intermediate computations.

This is where quantization becomes essential. Quantization reduces the precision of the weights—typically from 16-bit to 4-bit or 8-bit—dramatically shrinking memory requirements with minimal quality loss:

FP16 (no quantization): Full precision, highest quality, maximum memory usage
INT8 / Q8_0: Roughly halves memory compared to FP16, nearly imperceptible quality loss
INT4 / Q4_K_M: Reduces memory to roughly one-quarter of FP16, excellent quality-to-size ratio
INT2 / Q2_K: Extreme compression, noticeable quality degradation, useful only for severely constrained devices

For most users, 4-bit quantization (Q4_K_M) hits the sweet spot. Modern quantization schemes like Unsloth's Dynamic 2.0 preserve higher precision on critical layers (attention weights) while aggressively compressing less important ones, making 4-bit models perform almost identically to their FP16 counterparts.

Key Frameworks at a Glance

The local deployment ecosystem has matured significantly. Here are the major frameworks and their ideal use cases:

Ollama: One-command deployment with automatic model management and API server. Best for beginners, rapid prototyping, and anyone who wants things to "just work."
llama.cpp: Lightweight C++ inference engine supporting CPU, CUDA, Metal (Apple Silicon), and Vulkan. Best for resource-constrained environments, edge devices, and users who want maximum control over GGUF quantized models.
vLLM: High-throughput GPU inference server with PagedAttention for efficient memory management. Best for production deployments, high-concurrency scenarios, and multi-GPU setups.
Text Generation Inference (TGI): Hugging Face's official serving solution. Best for enterprise deployments that are already invested in the Hugging Face ecosystem.
LM Studio: Desktop application with a graphical interface. Best for non-technical users who want a ChatGPT-like experience locally.
Open WebUI: Browser-based interface that connects to any backend (Ollama, vLLM, TGI). Best for teams wanting a private, multi-user chat interface.

A Decision Framework

Use this simple decision tree to pick your framework:

You want the fastest path to a working model → Ollama
You have limited hardware (no GPU, old laptop) → llama.cpp
You need to serve many concurrent users in production → vLLM
You want an enterprise-grade containerized setup → TGI
You prefer a graphical interface over the terminal → LM Studio
You need a multi-user web interface for your team → Open WebUI + Ollama

Method 1: Ollama — The Fastest Path from Zero to Chat

Ollama has become the de facto standard for quick local model deployment. It handles model downloading, quantization selection, API server management, and even provides a built-in command-line chat interface.

Installation

bash

# macOS / Linux — one-line install
curl -fsSL https://ollama.com/install.sh | sh
 
# Windows — download the installer from https://ollama.com/download
 
# Run OllamaSetup.exe, ensure "Add to PATH" is checked
 
# Restart your terminal, then verify:
ollama --version

Deploying Your First Model

Ollama uses a simple pull and run workflow. Let us start with three flagship models:

bash

# Qwen 3.5 — outstanding Chinese and multilingual capability
ollama pull qwen3:8b
 
# Llama 3.1 — Meta's versatile general-purpose model
ollama pull llama3.1:8b
 
# GLM-4 — excellent for bilingual Chinese-English tasks
ollama pull glm4:9b

Ollama automatically selects the appropriate quantized version for your hardware. For an 8B model, expect roughly 4–5 GB of disk space and 6–8 GB of memory during inference.

Chatting via Command Line

bash

# Start an interactive chat session
ollama run qwen3:8b
 
# You will see a prompt like:
 
# >>> Hello, can you help me write a Python function?

Using the REST API

Ollama automatically starts an OpenAI-compatible API server on port 11434:

bash

curl http://localhost:11434/api/chat -d '{
  "model": "qwen3:8b",
  "messages": [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a quicksort function in Python"}
  ],
  "stream": false
}'

This is particularly powerful because any tool that supports the OpenAI API format can connect to Ollama simply by changing the base URL. For example, in Python:

python

from openai import OpenAI
 
# Point the OpenAI client at your local Ollama instance
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Ollama does not require a real key
)
 
response = client.chat.completions.create(
    model="qwen3:8b",
    messages=[
        {"role": "user", "content": "Explain quantum computing in simple terms"}
    ],
    temperature=0.7
)
print(response.choices[0].message.content)

Model Management

bash

1	`ollama list # List all locally downloaded models`
2	`ollama rm qwen3:8b # Remove a model to free disk space`
3	`ollama show qwen3:8b # Show model details and parameters`

Creating a Custom Model with a Modelfile

You can customize a model's behavior (system prompt, parameters, template) using Ollama's Modelfile:

bash

# Create a Modelfile
cat > MyCoder <<EOF
FROM qwen3:8b
 
# Set a custom system prompt
SYSTEM You are an expert Python developer. Always provide type hints, docstrings, and unit tests for every function you write.
 
# Adjust generation parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
EOF
 
# Build and run your custom model
ollama create my-coder -f MyCoder
ollama run my-coder

Method 2: llama.cpp — Maximum Control, Minimum Resources

When you need to run models on hardware that Ollama cannot handle, or when you want fine-grained control over every inference parameter, llama.cpp is the answer. It runs on everything from a Raspberry Pi to a multi-GPU server.

Compilation

bash

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
 
# CPU-only build (works everywhere)
cmake -B build -DGGML_CUDA=OFF
cmake --build build --config Release -j
 
# NVIDIA GPU acceleration (recommended if you have a GPU)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j
 
# Apple Silicon (M1/M2/M3/M4) — Metal GPU acceleration
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j

Obtaining and Preparing Models

llama.cpp uses the GGUF format, which packages quantized weights and model metadata into a single file. You have two options:

Option A: Download pre-quantized GGUF files directly

bash

pip install huggingface_hub
 
# Download Unsloth's optimized Q4_K_M quantization of Qwen 3.5
huggingface-cli download unsloth/Qwen3.5-9B-GGUF \
  - -include "Qwen3.5-9B-Q4_K_M.gguf" \
  - -local-dir ./models

Option B: Convert and quantize a model yourself

bash

# Download the original HuggingFace model
git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
 
# Convert to GGUF (FP16 intermediate format)
python convert_hf_to_gguf.py Qwen2.5-7B-Instruct \
  - -outfile qwen2.5-7b-fp16.gguf
 
# Quantize to INT4 (Q4_K_M — recommended balance)
./build/bin/llama-quantize \
  qwen2.5-7b-fp16.gguf \
  qwen2.5-7b-q4_k_m.gguf \
  q4_k_m

Running Interactive Chat

bash

1	`./build/bin/llama-cli \`
2	`- m ./models/Qwen3.5-9B-Q4_K_M.gguf \`
3	`- -ctx-size 8192 \`
4	`- cnv`

Key parameters explained:

-m: Path to the GGUF model file
--ctx-size: Context window size in tokens. Larger values use more memory but allow longer conversations
-cnv: Enable conversational mode (chat mode)
-n: Maximum number of tokens to generate per response
--temp: Sampling temperature (0 = deterministic, 1 = creative)

Launching an API Server

bash

./build/bin/llama-server \
  - m ./models/Qwen3.5-9B-Q4_K_M.gguf \
  - -ctx-size 8192 \
  - -port 8080 \
  - -n-gpu-layers 35

The --n-gpu-layers parameter controls how many transformer layers to offload to the GPU. Set it to 0 for pure CPU inference, or a high number (like 35 or 99) to offload as much as possible to the GPU.

The server exposes an OpenAI-compatible API:

python

from openai import OpenAI
 
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)
 
response = client.chat.completions.create(
    model="local-model",
    messages=[
        {"role": "user", "content": "Translate this to French: Hello, how are you?"}
    ]
)
print(response.choices[0].message.content)

Choosing the Right Quantization Level

The GGUF ecosystem offers many quantization variants. Here is a practical guide:

Q4_K_M: The recommended default. Excellent quality, small file size. Use this unless you have a specific reason not to.
Q4_K_XL / UD-Q4_K_XL: Unsloth's dynamic 4-bit quantization. Slightly better quality than Q4_K_M at similar size. Highly recommended when available.
Q5_K_M: A quality step up from Q4, with moderate size increase. Good choice if you have extra memory.
Q8_0: Nearly indistinguishable from FP16 quality. Use when memory is not a constraint.
Q2_K: Extreme compression. Only consider this for devices with severe memory limitations.

Method 3: vLLM — Production-Grade GPU Serving

When you need to serve models at scale—with high throughput, low latency, and concurrent request handling—vLLM is the framework of choice. Its PagedAttention algorithm manages the KV cache efficiently, enabling throughput that can be 10–20× higher than naive implementations.

Environment Setup

bash

# Create a clean conda environment
conda create -n vllm python=3.12 -y
conda activate vllm
 
# Install vLLM (requires NVIDIA GPU with CUDA)
pip install vllm

Single-GPU Deployment

bash

python -m vllm.entrypoints.openai.api_server \
  - -model Qwen/Qwen2.5-7B-Instruct \
  - -port 8000 \
  - -max-model-len 8192 \
  - -gpu-memory-utilization 0.90

Multi-GPU Deployment (Tensor Parallelism)

For larger models or higher throughput, distribute across multiple GPUs:

bash

python -m vllm.entrypoints.openai.api_server \
  - -model Qwen/Qwen3.5-27B-Instruct \
  - -tensor-parallel-size 2 \
  - -port 8000 \
  - -max-model-len 16384 \
  - -gpu-memory-utilization 0.90

The --tensor-parallel-size parameter specifies the number of GPUs to split the model across. A 27B model on 2 GPUs requires roughly 14 GB of VRAM per GPU at FP16.

Deploying Llama 3.1

bash

1	`python -m vllm.entrypoints.openai.api_server \`
2	`- -model meta-llama/Meta-Llama-3.1-8B-Instruct \`
3	`- -port 8000 \`
4	`- -max-model-len 8192`

Important: Llama models require you to first accept the Meta license on Hugging Face and log in:

bash

1	`huggingface-cli login`
2
3	`# Paste your Hugging Face access token (obtain from https://huggingface.co/settings/tokens)`

API Usage

vLLM provides a fully OpenAI-compatible API:

bash

curl http://localhost:8000/v1/chat/completions \
  - H "Content-Type: application/json" \
  - d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a knowledgeable assistant."},
      {"role": "user", "content": "Explain the difference between TCP and UDP."}
    ],
    "temperature": 0.7,
    "max_tokens": 1024
  }'

Streaming Responses

For real-time output (similar to ChatGPT's typing effect):

python

from openai import OpenAI
 
client = OpenAI(base_url="http://localhost:8000/v1", api_key="empty")
 
stream = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Write a poem about programming"}],
    stream=True
)
 
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Method 4: Enterprise Deployment with TGI and Docker

Text Generation Inference (TGI) is Hugging Face's official production server. It offers built-in health checks, metrics endpoints, token streaming, and first-class Docker support.

Docker Deployment

bash

# Pull the latest TGI image
docker pull ghcr.io/huggingface/text-generation-inference:latest
 
# Launch a model server
docker run --gpus all \
  - p 8080:80 \
  - v ~/.cache/huggingface:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  - -model-id Qwen/Qwen2.5-7B-Instruct \
  - -quantize bitsandbytes-nf4 \
  - -max-input-length 4096 \
  - -max-total-tokens 8192

Health Check and Monitoring

bash

# Verify the server is healthy
curl http://localhost:8080/health
 
# Get model info
curl http://localhost:8080/info
 
# Generate text
curl http://localhost:8080/generate \
  - H "Content-Type: application/json" \
  - d '{
    "inputs": "The key to successful machine learning is",
    "parameters": {"max_new_tokens": 200, "temperature": 0.7}
  }'

TGI also exposes Prometheus metrics at `/metrics`, making it straightforward to integrate with monitoring systems like Grafana.

Adding a Visual Interface with Open WebUI

Running models from the command line is fine for development, but most users prefer a browser-based chat interface. Open WebUI provides a polished, ChatGPT-like experience that connects to any backend.

Docker Deployment

bash

docker run -d -p 3000:8080 \
  - e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  - v open-webui:/app/backend/data \
  - -name open-webui \
  ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 in your browser. The first time you visit, you will create an admin account. After that:

Open Settings → Connections and verify Ollama is detected automatically
Click the model selector at the top to choose from your downloaded models
Start chatting

Open WebUI supports multi-user accounts, conversation history, document upload for RAG, web search integration, and a rich plugin ecosystem. For teams wanting a private ChatGPT alternative, it is remarkably capable.

Model-Specific Deployment Notes

Qwen 3.5 (Alibaba)

Qwen 3.5 is arguably the best open-source model family for Chinese language tasks and strong multilingual performance generally. It is available in sizes from 0.8B to 27B+.

Key details:

The small models (0.8B, 2B, 4B, 9B) have thinking mode disabled by default, unlike the larger variants
To enable thinking mode in llama.cpp, add --chat-template-kwargs '{"enable_thinking":true}'
Unsloth provides optimized GGUF quantizations with their Dynamic 2.0 scheme

Recommended sampling parameters:

python

# For Qwen 3.5 models
temperature = 0.6
top_p = 0.95
top_k = 20
min_p = 0.05

Hardware requirements for the 9B model (Q4 quantization): approximately 9 GB of total memory (RAM + VRAM). A MacBook Pro with 16 GB of unified memory runs it comfortably.

Llama 3.1 / 3.3 (Meta)

Meta's Llama family remains one of the most widely deployed open-source models. It excels at general reasoning, English language tasks, and code generation.

Key details:

Requires accepting Meta's license agreement on Hugging Face before downloading
The 8B variant is ideal for consumer hardware; the 70B variant requires multi-GPU setups or aggressive quantization
Available through Ollama with a single ollama pull llama3.1:8b

GLM-4 (Zhipu AI)

GLM-4 is particularly strong at bilingual Chinese-English tasks and has excellent instruction-following capabilities.

Key details:

The 9B variant fits comfortably on consumer hardware
Available on Ollama: ollama pull glm4:9b
Supports a 128K context window in the full-precision version

Fine-Tuning: Building Your Own Specialized Model

Running a generic model is just the beginning. The real power of local deployment lies in fine-tuning—adapting a model to your specific domain, style, or task.

Quick Fine-Tuning with Unsloth

Unsloth dramatically simplifies and accelerates the fine-tuning process. It supports QLoRA (Quantized Low-Rank Adaptation), which means you can fine-tune a 9B model on a single consumer GPU with as little as 12 GB of VRAM—or even for free on Google Colab's T4 GPU.

Installation:

bash

1	`pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo`

Complete fine-tuning script:

python

from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
 
# Configuration
max_seq_length = 2048  # Start small, increase later if needed
model_name = "Qwen/Qwen3.5-9B"  # Change to 0.8B, 2B, or 4B for smaller models
 
# Step 1: Load the model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    load_in_4bit=True,        # QLoRA: load in 4-bit to save VRAM
    full_finetuning=False,
)
 
# Step 2: Attach LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                       # LoRA rank — higher = more capacity, more VRAM
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",  # Unsloth's optimized checkpointing
    random_state=3407,
    max_seq_length=max_seq_length,
)
 
# Step 3: Prepare your dataset
 
# Replace this with your own data in the same format
dataset = load_dataset(
    "json",
    data_files={"train": "your_training_data.jsonl"},
    split="train"
)
 
# Step 4: Train
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    tokenizer=tokenizer,
    args=SFTConfig(
        max_seq_length=max_seq_length,
        per_device_train_batch_size=1,      # Lower if VRAM is tight
        gradient_accumulation_steps=4,      # Effective batch size = 1 * 4 = 4
        warmup_steps=10,
        max_steps=100,                      # Increase for larger datasets
        logging_steps=1,
        output_dir="outputs",
        optim="adamw_8bit",                 # 8-bit optimizer saves VRAM
        seed=3407,
    ),
)
 
trainer.train()
 
# Step 5: Save the fine-tuned model
model.save_pretrained("my-finetuned-model")
tokenizer.save_pretrained("my-finetuned-model")

Preparing Training Data

Your training data should be in JSONL format with each line containing an instruction-response pair:

json

{"messages": [{"role": "user", "content": "What are the side effects of aspirin?"}, {"role": "assistant", "content": "Common side effects of aspirin include stomach irritation, heartburn, and nausea..."}]}

{"messages": [{"role": "user", "content": "How do I reset my router?"}, {"role": "assistant", "content": "To reset your router, locate the small reset button on the back..."}]}

Tips for Successful Fine-Tuning

Start small. Use the 0.8B or 2B model first to validate your pipeline, then scale up.
Quality over quantity. 500 high-quality examples often outperform 50,000 mediocre ones.
Monitor loss closely. If training loss drops but the model generates gibberish, you may be overfitting—reduce the learning rate or number of steps.
Use gradient checkpointing. The unsloth mode is optimized specifically for this use case and significantly reduces VRAM consumption.
Free option: Unsloth provides ready-made Colab notebooks for each Qwen 3.5 model size—open one in your browser and start training immediately with zero local setup.

Optimization Best Practices

Hardware Acceleration

Different hardware platforms benefit from different approaches:

NVIDIA GPUs: Use CUDA-compiled llama.cpp or vLLM for maximum performance. Enable FlashAttention in vLLM (--enable-flash-attn) for additional speedup.
Apple Silicon (M-series): Use llama.cpp compiled with Metal support, or explore Apple's MLX framework. The unified memory architecture means a 16 GB Mac can run models that would require a 16 GB GPU on other platforms.
CPU only: llama.cpp is your best option. It is heavily optimized for CPU inference with SIMD instructions (AVX2, AVX-512). Expect reasonable speed on modern CPUs for 7B-9B models at Q4 quantization.

Memory Management

Close unnecessary applications before running large models to maximize available memory.
Use swap space wisely. On Linux, adding swap can prevent out-of-memory crashes, but disk-based swap is extremely slow. Only use it as a safety net, not a primary strategy.
Reduce context size (--ctx-size) when possible. A 32K context window uses roughly 4× more memory than an 8K window.

Quantization Selection Guide

Personal use / experimentation: Q4_K_M or UD-Q4_K_XL
Production deployment with quality requirements: Q8_0 or FP16 with vLLM
Memory-constrained devices: Q2_K or Q3_K_M (with quality trade-off awareness)
Balanced production: GPTQ INT4 or AWQ (supported by vLLM and TGI)

Common Pitfalls and How to Avoid Them

Out of memory errors. Always check your model's memory requirement before loading. A Q4-quantized 7B model needs about 4–5 GB. An unquantized 70B model needs about 140 GB. Plan accordingly.
Slow first inference. The first prompt after loading a model is always slower because the model must compile CUDA kernels and fill the KV cache. Subsequent prompts are faster. Do not benchmark on the first run.
Context window overflow. If your conversation grows beyond the model's context window, it will truncate earlier messages or fail. Monitor token counts in long conversations.
Model format mismatch. Ollama uses its own format, llama.cpp uses GGUF, and vLLM uses HuggingFace safetensors. Make sure you download the correct format for your framework.
License compliance. Some models (notably Llama) require license acceptance. Read and comply with each model's license before use in commercial applications.

Recommended Learning Path

If you are just getting started, follow this progression:

Install Ollama and run your first model (15 minutes)
Try different models (Qwen, Llama, GLM) to find the best fit for your use case
Set up Open WebUI for a comfortable chat interface (30 minutes)
Learn llama.cpp for finer control and resource-constrained scenarios
Explore vLLM when you need production-grade performance
Fine-tune a model with Unsloth to create your own specialized assistant
Build applications by connecting local models to your software via the OpenAI-compatible API

Conclusion

Running large language models locally has never been more accessible. With tools like Ollama, a capable AI assistant is literally one command away. For those willing to dig deeper, llama.cpp offers unmatched flexibility across diverse hardware, vLLM delivers production-grade throughput, and Unsloth democratizes fine-tuning. The combination of powerful open-source models (Llama 3, Qwen 3.5, GLM-4) and mature deployment tools means that the barriers to local AI—privacy, cost, and control—are now problems of the past.

Start with Ollama and a 7B-8B model today. You will be surprised at how much capability fits on your local machine.

1	`# macOS / Linux — one-line install`
2	`curl -fsSL https://ollama.com/install.sh \| sh`
3
4	`# Windows — download the installer from https://ollama.com/download`
5
6	`# Run OllamaSetup.exe, ensure "Add to PATH" is checked`
7
8	`# Restart your terminal, then verify:`
9	`ollama --version`

1	`# Qwen 3.5 — outstanding Chinese and multilingual capability`
2	`ollama pull qwen3:8b`
3
4	`# Llama 3.1 — Meta's versatile general-purpose model`
5	`ollama pull llama3.1:8b`
6
7	`# GLM-4 — excellent for bilingual Chinese-English tasks`
8	`ollama pull glm4:9b`

1	`# Start an interactive chat session`
2	`ollama run qwen3:8b`
3
4	`# You will see a prompt like:`
5
6	`# >>> Hello, can you help me write a Python function?`

1	`curl http://localhost:11434/api/chat -d '{`
2	`"model": "qwen3:8b",`
3	`"messages": [`
4	`{"role": "system", "content": "You are a helpful coding assistant."},`
5	`{"role": "user", "content": "Write a quicksort function in Python"}`
6	`],`
7	`"stream": false`
8	`}'`

1	`from openai import OpenAI`
2
3	`# Point the OpenAI client at your local Ollama instance`
4	`client = OpenAI(`
5	`base_url="http://localhost:11434/v1",`
6	`api_key="ollama" # Ollama does not require a real key`
7	`)`
8
9	`response = client.chat.completions.create(`
10	`model="qwen3:8b",`
11	`messages=[`
12	`{"role": "user", "content": "Explain quantum computing in simple terms"}`
13	`],`
14	`temperature=0.7`
15	`)`
16	`print(response.choices[0].message.content)`

1	`# Create a Modelfile`
2	`cat > MyCoder <<EOF`
3	`FROM qwen3:8b`
4
5	`# Set a custom system prompt`
6	`SYSTEM You are an expert Python developer. Always provide type hints, docstrings, and unit tests for every function you write.`
7
8	`# Adjust generation parameters`
9	`PARAMETER temperature 0.3`
10	`PARAMETER top_p 0.9`
11	`PARAMETER num_ctx 8192`
12	`EOF`
13
14	`# Build and run your custom model`
15	`ollama create my-coder -f MyCoder`
16	`ollama run my-coder`

1	`git clone https://github.com/ggerganov/llama.cpp.git`
2	`cd llama.cpp`
3
4	`# CPU-only build (works everywhere)`
5	`cmake -B build -DGGML_CUDA=OFF`
6	`cmake --build build --config Release -j`
7
8	`# NVIDIA GPU acceleration (recommended if you have a GPU)`
9	`cmake -B build -DGGML_CUDA=ON`
10	`cmake --build build --config Release -j`
11
12	`# Apple Silicon (M1/M2/M3/M4) — Metal GPU acceleration`
13	`cmake -B build -DGGML_METAL=ON`
14	`cmake --build build --config Release -j`

1	`pip install huggingface_hub`
2
3	`# Download Unsloth's optimized Q4_K_M quantization of Qwen 3.5`
4	`huggingface-cli download unsloth/Qwen3.5-9B-GGUF \`
5	`- -include "Qwen3.5-9B-Q4_K_M.gguf" \`
6	`- -local-dir ./models`

1	`# Download the original HuggingFace model`
2	`git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct`
3
4	`# Convert to GGUF (FP16 intermediate format)`
5	`python convert_hf_to_gguf.py Qwen2.5-7B-Instruct \`
6	`- -outfile qwen2.5-7b-fp16.gguf`
7
8	`# Quantize to INT4 (Q4_K_M — recommended balance)`
9	`./build/bin/llama-quantize \`
10	`qwen2.5-7b-fp16.gguf \`
11	`qwen2.5-7b-q4_k_m.gguf \`
12	`q4_k_m`

1	`./build/bin/llama-server \`
2	`- m ./models/Qwen3.5-9B-Q4_K_M.gguf \`
3	`- -ctx-size 8192 \`
4	`- -port 8080 \`
5	`- -n-gpu-layers 35`

1	`from openai import OpenAI`
2
3	`client = OpenAI(`
4	`base_url="http://localhost:8080/v1",`
5	`api_key="not-needed"`
6	`)`
7
8	`response = client.chat.completions.create(`
9	`model="local-model",`
10	`messages=[`
11	`{"role": "user", "content": "Translate this to French: Hello, how are you?"}`
12	`]`
13	`)`
14	`print(response.choices[0].message.content)`

1	`# Create a clean conda environment`
2	`conda create -n vllm python=3.12 -y`
3	`conda activate vllm`
4
5	`# Install vLLM (requires NVIDIA GPU with CUDA)`
6	`pip install vllm`

1	`python -m vllm.entrypoints.openai.api_server \`
2	`- -model Qwen/Qwen2.5-7B-Instruct \`
3	`- -port 8000 \`
4	`- -max-model-len 8192 \`
5	`- -gpu-memory-utilization 0.90`

1	`python -m vllm.entrypoints.openai.api_server \`
2	`- -model Qwen/Qwen3.5-27B-Instruct \`
3	`- -tensor-parallel-size 2 \`
4	`- -port 8000 \`
5	`- -max-model-len 16384 \`
6	`- -gpu-memory-utilization 0.90`

1	`curl http://localhost:8000/v1/chat/completions \`
2	`- H "Content-Type: application/json" \`
3	`- d '{`
4	`"model": "Qwen/Qwen2.5-7B-Instruct",`
5	`"messages": [`
6	`{"role": "system", "content": "You are a knowledgeable assistant."},`
7	`{"role": "user", "content": "Explain the difference between TCP and UDP."}`
8	`],`
9	`"temperature": 0.7,`
10	`"max_tokens": 1024`
11	`}'`

1	`# For Qwen 3.5 models`
2	`temperature = 0.6`
3	`top_p = 0.95`
4	`top_k = 20`
5	`min_p = 0.05`

1	`from openai import OpenAI`
2
3	`client = OpenAI(base_url="http://localhost:8000/v1", api_key="empty")`
4
5	`stream = client.chat.completions.create(`
6	`model="Qwen/Qwen2.5-7B-Instruct",`
7	`messages=[{"role": "user", "content": "Write a poem about programming"}],`
8	`stream=True`
9	`)`
10
11	`for chunk in stream:`
12	`if chunk.choices[0].delta.content:`
13	`print(chunk.choices[0].delta.content, end="", flush=True)`

1	`# Pull the latest TGI image`
2	`docker pull ghcr.io/huggingface/text-generation-inference:latest`
3
4	`# Launch a model server`
5	`docker run --gpus all \`
6	`- p 8080:80 \`
7	`- v ~/.cache/huggingface:/data \`
8	`ghcr.io/huggingface/text-generation-inference:latest \`
9	`- -model-id Qwen/Qwen2.5-7B-Instruct \`
10	`- -quantize bitsandbytes-nf4 \`
11	`- -max-input-length 4096 \`
12	`- -max-total-tokens 8192`

1	`# Verify the server is healthy`
2	`curl http://localhost:8080/health`
3
4	`# Get model info`
5	`curl http://localhost:8080/info`
6
7	`# Generate text`
8	`curl http://localhost:8080/generate \`
9	`- H "Content-Type: application/json" \`
10	`- d '{`
11	`"inputs": "The key to successful machine learning is",`
12	`"parameters": {"max_new_tokens": 200, "temperature": 0.7}`
13	`}'`

1	`docker run -d -p 3000:8080 \`
2	`- e OLLAMA_BASE_URL=http://host.docker.internal:11434 \`
3	`- v open-webui:/app/backend/data \`
4	`- -name open-webui \`
5	`ghcr.io/open-webui/open-webui:main`

1	`from unsloth import FastLanguageModel`
2	`import torch`
3	`from datasets import load_dataset`
4	`from trl import SFTTrainer, SFTConfig`
5
6	`# Configuration`
7	`max_seq_length = 2048 # Start small, increase later if needed`
8	`model_name = "Qwen/Qwen3.5-9B" # Change to 0.8B, 2B, or 4B for smaller models`
9
10	`# Step 1: Load the model with 4-bit quantization`
11	`model, tokenizer = FastLanguageModel.from_pretrained(`
12	`model_name=model_name,`
13	`max_seq_length=max_seq_length,`
14	`load_in_4bit=True, # QLoRA: load in 4-bit to save VRAM`
15	`full_finetuning=False,`
16	`)`
17
18	`# Step 2: Attach LoRA adapters`
19	`model = FastLanguageModel.get_peft_model(`
20	`model,`
21	`r=16, # LoRA rank — higher = more capacity, more VRAM`
22	`target_modules=[`
23	`"q_proj", "k_proj", "v_proj", "o_proj",`
24	`"gate_proj", "up_proj", "down_proj",`
25	`],`
26	`lora_alpha=16,`
27	`lora_dropout=0,`
28	`bias="none",`
29	`use_gradient_checkpointing="unsloth", # Unsloth's optimized checkpointing`
30	`random_state=3407,`
31	`max_seq_length=max_seq_length,`
32	`)`
33
34	`# Step 3: Prepare your dataset`
35
36	`# Replace this with your own data in the same format`
37	`dataset = load_dataset(`
38	`"json",`
39	`data_files={"train": "your_training_data.jsonl"},`
40	`split="train"`
41	`)`
42
43	`# Step 4: Train`
44	`trainer = SFTTrainer(`
45	`model=model,`
46	`train_dataset=dataset,`
47	`tokenizer=tokenizer,`
48	`args=SFTConfig(`
49	`max_seq_length=max_seq_length,`
50	`per_device_train_batch_size=1, # Lower if VRAM is tight`
51	`gradient_accumulation_steps=4, # Effective batch size = 1 * 4 = 4`
52	`warmup_steps=10,`
53	`max_steps=100, # Increase for larger datasets`
54	`logging_steps=1,`
55	`output_dir="outputs",`
56	`optim="adamw_8bit", # 8-bit optimizer saves VRAM`
57	`seed=3407,`
58	`),`
59	`)`
60
61	`trainer.train()`
62
63	`# Step 5: Save the fine-tuned model`
64	`model.save_pretrained("my-finetuned-model")`
65	`tokenizer.save_pretrained("my-finetuned-model")`

Introduction

Understanding the Fundamentals

What Happens When You Run a Model Locally

Key Frameworks at a Glance

A Decision Framework

Method 1: Ollama — The Fastest Path from Zero to Chat

Installation

Deploying Your First Model

Chatting via Command Line

Using the REST API

Model Management

Creating a Custom Model with a Modelfile

Method 2: llama.cpp — Maximum Control, Minimum Resources

Compilation

Obtaining and Preparing Models

Running Interactive Chat

Launching an API Server

Choosing the Right Quantization Level

Method 3: vLLM — Production-Grade GPU Serving

Environment Setup

Single-GPU Deployment

Multi-GPU Deployment (Tensor Parallelism)

Deploying Llama 3.1

API Usage

Streaming Responses

Method 4: Enterprise Deployment with TGI and Docker

Docker Deployment

Health Check and Monitoring

TGI also exposes Prometheus metrics at /metrics, making it straightforward to integrate with monitoring systems like Grafana.

Adding a Visual Interface with Open WebUI

Docker Deployment

Open WebUI supports multi-user accounts, conversation history, document upload for RAG, web search integration, and a rich plugin ecosystem. For teams wanting a private ChatGPT alternative, it is remarkably capable.

Model-Specific Deployment Notes

Qwen 3.5 (Alibaba)

Llama 3.1 / 3.3 (Meta)

GLM-4 (Zhipu AI)

Fine-Tuning: Building Your Own Specialized Model

Quick Fine-Tuning with Unsloth

Preparing Training Data

Tips for Successful Fine-Tuning

Optimization Best Practices

Hardware Acceleration

Memory Management

Quantization Selection Guide

Common Pitfalls and How to Avoid Them

Recommended Learning Path

Conclusion

TGI also exposes Prometheus metrics at `/metrics`, making it straightforward to integrate with monitoring systems like Grafana.