RAG in Practice: Building a Knowledge Q&A System That Actually Works
I've been running into the same problem over and over while working on Hermes Agent: how do you make an AI remember everything in your project's docs?
Shove everything into the prompt? Context window isn't big enough. Fine-tune the model? Have to retrain every time docs change. That's expensive and slow.
Turns out RAG is the standard answer to this problem. It's not a new concept, but the tooling in 2026 is way more mature than it was two years ago. Worth revisiting.
What RAG Actually Is
RAG stands for Retrieval-Augmented Generation. Sounds complicated, but the idea is dead simple:
Search first, then answer.
When you ask the AI a question, it doesn't rely on its own "memory." Instead, it searches your knowledge base first, grabs the relevant bits, and uses those as reference to compose an answer.
Think of it like a doctor who can't memorize every case but knows how to look up the medical records system. RAG gives AI that "look things up" ability.
The benefits are pretty clear:
- No retraining needed — just update the knowledge base
- Answers can cite sources, so they're verifiable
- Private data doesn't need to go into model training
- Knowledge base can be updated anytime — no "knowledge cutoff" problem
The Simplest RAG Pipeline
A working RAG system has five core steps:
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
Looks simple enough, right? But the devil is in the details.
The "load documents" step can blow up in your face — how do you handle tables in PDFs? What about text in images? Chinese PDF encoding issues?
The "split" step is even worse. Too large, and retrieval precision drops. Too small, and context breaks — the model can't understand fragments.
I've hit all these pitfalls. Let me walk through them.
Document Chunking: Where RAG Goes Wrong Most Often
People think RAG's core is the vector database or the embedding model. It's not. The quality of your document chunking directly determines RAG's ceiling.
If chunking is bad, everything downstream is just polishing garbage.
Basic Chunking Strategies
The simplest approach is fixed-size chunking — cut every N characters:
| 1 | |
| 2 | |
The problem? It doesn't care about semantics at all. A sentence might get cut in half, making both fragments useless.
Slightly better is paragraph-based splitting:
| 1 | |
| 2 | |
But paragraphs vary wildly in length — some are two sentences, some are two thousand words.
What I Actually Recommend: Recursive Character Splitting
LangChain's RecursiveCharacterTextSplitter tries the big delimiter first (\n\n), then falls back to smaller ones (\n) if the chunk is still too long:
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
The chunk_overlap matters a lot. Without it, if a concept spans two chunks, neither chunk has the complete picture. 50-100 characters of overlap is usually enough.
The Chinese Chunking Pitfall
Chinese doesn't have natural word boundaries like English spaces, so chunking needs extra care:
- Use Chinese punctuation as separators. The default
["\n\n", "\n", " ", ""]is designed for English. For Chinese, add"。","!","?"etc. - Don't set chunk_size too small. Chinese packs more information per character. 500 Chinese characters is roughly equivalent to 200 English words — already pretty short.
- Don't hard-cut tables or code blocks. If your docs contain code or tables, identify them first and keep them intact.
I once processed a batch of Chinese technical docs with default English separators. Half the sentences came out as fragments. The retrieved chunks were all half-sentences that the model couldn't use. Switching to Chinese separators fixed most of it.
Choosing an Embedding Model
Embedding converts text into vectors. Distance between vectors represents semantic similarity — closer means more similar.
The main embedding models in 2026:
- OpenAI text-embedding-3-small/large: Good quality, API-based, per-token pricing. The small version is great value.
- BGE series (BAAI): Open source, strong Chinese support, self-hostable. bge-large-zh-v1.5 is a top choice for Chinese content.
- Jina embeddings v3: Multilingual, open source, self-hostable.
- Cohere embed-v4: Good multilingual support, API-based.
For Chinese-heavy data, the BGE series is the way to go. No API costs, and it understands Chinese semantics better than OpenAI's models.
Local BGE deployment:
| 1 | |
| 2 | |
| 3 | |
| 4 | |
One important thing: once you pick an embedding model, don't switch casually. Switching means re-embedding everything you've stored — that's not cheap.
Vector Databases: Storing and Querying
A vector database stores vectors and does similarity search. Lots of options in 2026:
- ChromaDB: Lightweight, Python-native, great for prototyping.
pip installand go. - Milvus/Zilliz: Production-grade, supports massive scale, distributed deployment.
- Qdrant: Written in Rust, great performance, clean API design.
- Pinecone: Fully managed SaaS — no ops overhead, but costs money.
- Weaviate: Supports hybrid search (vector + keyword), feature-rich.
- pgvector: PostgreSQL extension. If you're already on PG, just add an extension — no new component needed.
For personal projects or small teams, ChromaDB or pgvector is enough. Don't spin up a Milvus cluster for a side project.
I use ChromaDB for Hermes Agent — mainly because it's dead simple:
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
| 11 | |
| 12 | |
| 13 | |
That's it. No server config, no connection pools. Data lives in a local folder.
Retrieval Strategies: It's Not Just "Find the Nearest"
Basic retrieval is vector similarity search — find the chunks most "similar" to the question. But in practice, that alone isn't enough.
Problem 1: Semantically Similar ≠ Answer-Relevant
A user asks "how to set up Python virtual environments." Vector search might return chunks about "Python" and "environments" in general, but the specific chunk about venv or conda ranks lower.
Fix: Reranking
Use vector search for a coarse top-20, then a reranker model to reorder by actual relevance:
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
| 11 | |
Rerankers are slower than embedding models, but they look at the full interaction between query and document — much more precise. Coarse-then-fine is the sweet spot between speed and accuracy.
Problem 2: User's Wording Doesn't Match the Docs
User says "how to deploy," docs say "installation and configuration guide." Semantically close, but keyword mismatch.
Fix: Hybrid Search
Combine vector search with keyword search:
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
Many vector databases now have hybrid search built in — Qdrant and Weaviate both support it natively. No need to write the fusion logic yourself.
What's New in RAG for 2026
RAG has evolved a lot in the past two years. The basic "chunk-embed-retrieve-generate" pipeline is still the same, but there are new optimization layers on top.
Agentic RAG
This is the hottest direction in 2026. The core idea: let an AI Agent decide how to search, instead of hardcoding the retrieval flow.
Traditional RAG is fixed: receive question → search → generate answer.
Agentic RAG hands the "search" step to the Agent:
- Does this question need to search the knowledge base at all?
- Are the results good enough? If not, try different keywords
- This question needs to be broken into sub-questions, searched separately, then combined
- This question should query a SQL database, not vector search
Implementation with LangChain:
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
| 11 | |
| 12 | |
| 13 | |
| 14 | |
The Agent decides on its own whether to search, how many times, and with what keywords. Much more flexible than a hardcoded pipeline.
Graph RAG
Microsoft's GraphRAG adds a knowledge graph layer on top of standard RAG.
Regular RAG retrieves "text chunks." Graph RAG retrieves "entities and relationships." If you ask "which project does Zhang San work on?", regular RAG might find a chunk mentioning "Zhang San" that doesn't include project info. Graph RAG can traverse the knowledge graph directly: "Zhang San → owns → Project X."
Works well for relationship-heavy scenarios like org charts, supply chains, and legal document cross-references.
The downside: building a knowledge graph is a pain. Not worth it for small projects.
Contextual Retrieval
Anthropic proposed a neat optimization: add context to each chunk during splitting.
Example: original text says "The product supports SSO." After chunking, this fragment loses the "which product?" context. Contextual Retrieval has the model prepend: "This is about Product X's features: The product supports SSO."
Chunks with context retrieve much more accurately. Anthropic's tests showed this simple technique reduced retrieval failure rates by 35%.
RAG vs Fine-tuning vs Long Context
In 2026, there are three paths to "make the model understand your data":
RAG: Don't modify the model, retrieve reference content in real-time. Good for data that changes often, need for source citations, and large data volumes.
Fine-tuning: Retrain parts of the model with your data. Good for changing model behavior (output style, specialized terminology). But expensive — data changes mean retraining.
Long context: Stuff all your data into the prompt. Models now support huge context windows — Claude handles 200K tokens, Gemini handles 1M. If your data isn't huge, just shoving it in might be simpler than building RAG.
How to choose:
- Small data (under ~50K characters) → Long context, simplest
- Large data, frequently updated → RAG
- Need to change model behavior/style → Fine-tuning
- Need source citations, explainability → RAG
- Budget-conscious, don't want to maintain infra → Long context or API calls
Honestly, a lot of people build RAG because it sounds "advanced." But if your data volume is small, long context is probably simpler and more effective. Don't use tech for tech's sake.
RAG Evaluation: How to Know If It's Working
After building RAG, how do you know if it's any good? "I tried a few questions and it seemed fine" isn't enough.
Core Metrics
RAG quality has two dimensions:
Retrieval quality: Are the retrieved chunks relevant to the question? Metrics: Recall@K (how many of top-K results are actually relevant) and MRR (where does the first relevant result rank).
Generation quality: Is the model's answer good given the retrieved context? Metrics: accuracy (is the answer correct), completeness (does it miss key info), and faithfulness (does it make up stuff not in the retrieved context).
That last one — faithfulness — is critical. A common RAG problem is hallucination: the model confidently generates information that wasn't in the retrieved results. You need to specifically check for this.
A Simple Evaluation Method
If you don't want a complex evaluation framework, try this:
- Prepare 20-30 test questions covering your knowledge base
- Manually annotate "correct answer" and "documents that should be retrieved"
- Run your RAG pipeline, compare results
- Calculate retrieval recall and answer accuracy
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
| 11 | |
| 12 | |
| 13 | |
| 14 | |
| 15 | |
For something more formal, check out RAGAS — a dedicated RAG evaluation framework that supports Faithfulness, Answer Relevancy, Context Recall, and other metrics.
Honestly, most people build RAG and never evaluate it. Then users complain, and they discover retrieval recall is only 40% — most questions can't find the right document. Spending half a day on evaluation saves a lot of headaches later.
Pitfalls I've Hit
Pitfall 1: chunk_size Too Large
I started with chunk_size=2000, thinking bigger chunks mean more complete context. Turns out each retrieved chunk was huge, and top_k=5 filled up the context window. The model couldn't find the important bits in all that noise.
Switched to 500 with reranking. Much better. Go small on chunks, use reranking for relevance, rather than big chunks full of noise.
Pitfall 2: No Metadata Filtering
My knowledge base had docs, issue records, and config files mixed together. When users asked "how to configure X," vector search returned both documentation and issue discussions — contradicting each other.
Added metadata filtering:
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
Pitfall 3: Embedding Model / Query Language Mismatch
My docs were in Chinese, but users sometimes asked in English. Using a Chinese-only embedding model for English queries gave terrible results.
Fix: use a multilingual embedding model like BGE-M3 or Jina embeddings v3 — handles both Chinese and English well.
Pitfall 4: Forgetting About Document Updates
I only built indexing, not updating. Docs changed but the vector DB still had old versions. Answers referenced outdated info.
Added a simple hash-based update mechanism:
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
| 11 | |
Hands-On: Build a Minimal RAG in 20 Minutes
Enough theory. Here's a minimal working RAG example with ChromaDB + OpenAI:
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
| 11 | |
| 12 | |
| 13 | |
| 14 | |
| 15 | |
| 16 | |
| 17 | |
| 18 | |
| 19 | |
| 20 | |
| 21 | |
| 22 | |
| 23 | |
| 24 | |
| 25 | |
| 26 | |
| 27 | |
| 28 | |
| 29 | |
| 30 | |
| 31 | |
| 32 | |
| 33 | |
| 34 | |
| 35 | |
| 36 | |
| 37 | |
That's all the code. It runs, answers questions, and can cite sources.
Production needs more: error handling, logging, update mechanisms, reranking, hybrid search. But this minimal version lets you understand the core RAG flow in 20 minutes.
Tool Recommendations
If you want to build a RAG app quickly without starting from scratch:
- LangChain: Most popular RAG framework. Rich ecosystem, but APIs change frequently and docs sometimes lag behind code.
- LlamaIndex: Focused specifically on RAG. More focused than LangChain, with a more elegant design for data indexing and querying.
- Dify: Open-source LLM platform. Drag-and-drop RAG workflow building, no coding required.
- RAGFlow: Open-source, strong document parsing (tables, image OCR). Great for Chinese document handling.
- FastGPT: User-friendly UI, good for quickly building knowledge base Q&A systems.
For adding doc Q&A to your own project, LlamaIndex is enough. For building a product for users, Dify or RAGFlow are better fits.
Final Thoughts
RAG isn't a silver bullet. It solves "let the model access your data," but it introduces new complexity: chunking strategy, embedding model selection, vector DB operations, retrieval quality tuning.
The good news in 2026: the tooling is mature. ChromaDB, LlamaIndex, Dify — these tools lower the barrier significantly.
But the core problems — how to chunk docs, how to ensure retrieval quality, how to handle data updates — you still have to tune for your specific use case. No one-size-fits-all config exists.
I'm planning to try Agentic RAG next — letting the Agent decide retrieval strategy instead of hardcoding it. I'll write that up when I get there.
Got questions? Drop them in the comments.