$catMANUAL||~50 min

RAG in Practice: Building a Knowledge Q&A System That Actually Works

advertisement

RAG in Practice: Building a Knowledge Q&A System That Actually Works

I've been running into the same problem over and over while working on Hermes Agent: how do you make an AI remember everything in your project's docs?

Shove everything into the prompt? Context window isn't big enough. Fine-tune the model? Have to retrain every time docs change. That's expensive and slow.

Turns out RAG is the standard answer to this problem. It's not a new concept, but the tooling in 2026 is way more mature than it was two years ago. Worth revisiting.

What RAG Actually Is

RAG stands for Retrieval-Augmented Generation. Sounds complicated, but the idea is dead simple:

Search first, then answer.

When you ask the AI a question, it doesn't rely on its own "memory." Instead, it searches your knowledge base first, grabs the relevant bits, and uses those as reference to compose an answer.

Think of it like a doctor who can't memorize every case but knows how to look up the medical records system. RAG gives AI that "look things up" ability.

The benefits are pretty clear:

  • No retraining needed — just update the knowledge base
  • Answers can cite sources, so they're verifiable
  • Private data doesn't need to go into model training
  • Knowledge base can be updated anytime — no "knowledge cutoff" problem

The Simplest RAG Pipeline

A working RAG system has five core steps:

python
1
# Pseudocode, don't run this
2
docs = load_documents("./my_docs/")      # 1. Load documents
3
chunks = split_into_chunks(docs)          # 2. Split into chunks
4
vectors = embed(chunks)                   # 3. Vectorize
5
store(vector_db, vectors)                 # 4. Store in vector DB
6
 
7
# At query time
8
question = "What is RAG?"
9
relevant = search(vector_db, embed(question), top_k=5)  # 5. Retrieve
10
answer = llm(f"Answer based on the following:\n{relevant}\n\nQuestion: {question}")

Looks simple enough, right? But the devil is in the details.

The "load documents" step can blow up in your face — how do you handle tables in PDFs? What about text in images? Chinese PDF encoding issues?

The "split" step is even worse. Too large, and retrieval precision drops. Too small, and context breaks — the model can't understand fragments.

I've hit all these pitfalls. Let me walk through them.

Document Chunking: Where RAG Goes Wrong Most Often

People think RAG's core is the vector database or the embedding model. It's not. The quality of your document chunking directly determines RAG's ceiling.

If chunking is bad, everything downstream is just polishing garbage.

Basic Chunking Strategies

The simplest approach is fixed-size chunking — cut every N characters:

python
1
def fixed_size_split(text, chunk_size=500):
2
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

The problem? It doesn't care about semantics at all. A sentence might get cut in half, making both fragments useless.

Slightly better is paragraph-based splitting:

python
1
def split_by_paragraph(text):
2
    return [p.strip() for p in text.split("\n\n") if p.strip()]

But paragraphs vary wildly in length — some are two sentences, some are two thousand words.

What I Actually Recommend: Recursive Character Splitting

LangChain's RecursiveCharacterTextSplitter tries the big delimiter first (\n\n), then falls back to smaller ones (\n) if the chunk is still too long:

python
1
from langchain.text_splitter import RecursiveCharacterTextSplitter
2
 
3
splitter = RecursiveCharacterTextSplitter(
4
    chunk_size=500,
5
    chunk_overlap=50,       # 50-char overlap between adjacent chunks
6
    separators=["\n\n", "\n", "。", ",", " "]  # Chinese punctuation for Chinese docs
7
)
8
 
9
chunks = splitter.split_text(document)

The chunk_overlap matters a lot. Without it, if a concept spans two chunks, neither chunk has the complete picture. 50-100 characters of overlap is usually enough.

The Chinese Chunking Pitfall

Chinese doesn't have natural word boundaries like English spaces, so chunking needs extra care:

  1. Use Chinese punctuation as separators. The default ["\n\n", "\n", " ", ""] is designed for English. For Chinese, add "。", "!", "?" etc.
  2. Don't set chunk_size too small. Chinese packs more information per character. 500 Chinese characters is roughly equivalent to 200 English words — already pretty short.
  3. Don't hard-cut tables or code blocks. If your docs contain code or tables, identify them first and keep them intact.

I once processed a batch of Chinese technical docs with default English separators. Half the sentences came out as fragments. The retrieved chunks were all half-sentences that the model couldn't use. Switching to Chinese separators fixed most of it.

Choosing an Embedding Model

Embedding converts text into vectors. Distance between vectors represents semantic similarity — closer means more similar.

The main embedding models in 2026:

  • OpenAI text-embedding-3-small/large: Good quality, API-based, per-token pricing. The small version is great value.
  • BGE series (BAAI): Open source, strong Chinese support, self-hostable. bge-large-zh-v1.5 is a top choice for Chinese content.
  • Jina embeddings v3: Multilingual, open source, self-hostable.
  • Cohere embed-v4: Good multilingual support, API-based.

For Chinese-heavy data, the BGE series is the way to go. No API costs, and it understands Chinese semantics better than OpenAI's models.

Local BGE deployment:

python
1
from sentence_transformers import SentenceTransformer
2
 
3
model = SentenceTransformer("BAAI/bge-large-zh-v1.5")
4
embeddings = model.encode(["This is a test", "RAG stands for Retrieval-Augmented Generation"])

One important thing: once you pick an embedding model, don't switch casually. Switching means re-embedding everything you've stored — that's not cheap.

Vector Databases: Storing and Querying

A vector database stores vectors and does similarity search. Lots of options in 2026:

  • ChromaDB: Lightweight, Python-native, great for prototyping. pip install and go.
  • Milvus/Zilliz: Production-grade, supports massive scale, distributed deployment.
  • Qdrant: Written in Rust, great performance, clean API design.
  • Pinecone: Fully managed SaaS — no ops overhead, but costs money.
  • Weaviate: Supports hybrid search (vector + keyword), feature-rich.
  • pgvector: PostgreSQL extension. If you're already on PG, just add an extension — no new component needed.

For personal projects or small teams, ChromaDB or pgvector is enough. Don't spin up a Milvus cluster for a side project.

I use ChromaDB for Hermes Agent — mainly because it's dead simple:

python
1
import chromadb
2
 
3
client = chromadb.PersistentClient(path="./chroma_db")
4
collection = client.get_or_create_collection("my_docs")
5
 
6
# Add documents
7
collection.add(
8
    documents=["RAG stands for Retrieval-Augmented Generation", "Vector DBs store embeddings"],
9
    ids=["doc1", "doc2"]
10
)
11
 
12
# Query
13
results = collection.query(query_texts=["What is RAG?"], n_results=3)

That's it. No server config, no connection pools. Data lives in a local folder.

Retrieval Strategies: It's Not Just "Find the Nearest"

Basic retrieval is vector similarity search — find the chunks most "similar" to the question. But in practice, that alone isn't enough.

Problem 1: Semantically Similar ≠ Answer-Relevant

A user asks "how to set up Python virtual environments." Vector search might return chunks about "Python" and "environments" in general, but the specific chunk about venv or conda ranks lower.

Fix: Reranking

Use vector search for a coarse top-20, then a reranker model to reorder by actual relevance:

python
1
from sentence_transformers import CrossEncoder
2
 
3
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")
4
 
5
# Coarse retrieval
6
candidates = vector_search(query, top_k=20)
7
 
8
# Fine ranking
9
scores = reranker.predict([(query, doc) for doc in candidates])
10
reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
11
top_results = [doc for doc, score in reranked[:5]]

Rerankers are slower than embedding models, but they look at the full interaction between query and document — much more precise. Coarse-then-fine is the sweet spot between speed and accuracy.

Problem 2: User's Wording Doesn't Match the Docs

User says "how to deploy," docs say "installation and configuration guide." Semantically close, but keyword mismatch.

Fix: Hybrid Search

Combine vector search with keyword search:

python
1
# Vector search results
2
vector_results = collection.query(query_texts=[query], n_results=10)
3
 
4
# Keyword search (if using pgvector + PostgreSQL, just use full-text search)
5
keyword_results = db.execute(
6
    "SELECT * FROM docs WHERE content ILIKE %s", [f"%{keyword}%"]
7
)
8
 
9
# Merge and deduplicate using RRF (Reciprocal Rank Fusion)
10
final_results = reciprocal_rank_fusion(vector_results, keyword_results)

Many vector databases now have hybrid search built in — Qdrant and Weaviate both support it natively. No need to write the fusion logic yourself.

What's New in RAG for 2026

RAG has evolved a lot in the past two years. The basic "chunk-embed-retrieve-generate" pipeline is still the same, but there are new optimization layers on top.

Agentic RAG

This is the hottest direction in 2026. The core idea: let an AI Agent decide how to search, instead of hardcoding the retrieval flow.

Traditional RAG is fixed: receive question → search → generate answer.

Agentic RAG hands the "search" step to the Agent:

  • Does this question need to search the knowledge base at all?
  • Are the results good enough? If not, try different keywords
  • This question needs to be broken into sub-questions, searched separately, then combined
  • This question should query a SQL database, not vector search

Implementation with LangChain:

python
1
from langchain.agents import create_react_agent
2
from langchain.tools import Tool
3
 
4
def search_knowledge_base(query: str) -> str:
5
    results = collection.query(query_texts=[query], n_results=3)
6
    return "\n".join(results["documents"][0])
7
 
8
tools = [
9
    Tool(name="SearchDocs", func=search_knowledge_base,
10
         description="Search internal documentation")
11
]
12
 
13
agent = create_react_agent(llm, tools, prompt)
14
response = agent.invoke({"input": "How do I deploy to production?"})

The Agent decides on its own whether to search, how many times, and with what keywords. Much more flexible than a hardcoded pipeline.

Graph RAG

Microsoft's GraphRAG adds a knowledge graph layer on top of standard RAG.

Regular RAG retrieves "text chunks." Graph RAG retrieves "entities and relationships." If you ask "which project does Zhang San work on?", regular RAG might find a chunk mentioning "Zhang San" that doesn't include project info. Graph RAG can traverse the knowledge graph directly: "Zhang San → owns → Project X."

Works well for relationship-heavy scenarios like org charts, supply chains, and legal document cross-references.

The downside: building a knowledge graph is a pain. Not worth it for small projects.

Contextual Retrieval

Anthropic proposed a neat optimization: add context to each chunk during splitting.

Example: original text says "The product supports SSO." After chunking, this fragment loses the "which product?" context. Contextual Retrieval has the model prepend: "This is about Product X's features: The product supports SSO."

Chunks with context retrieve much more accurately. Anthropic's tests showed this simple technique reduced retrieval failure rates by 35%.

RAG vs Fine-tuning vs Long Context

In 2026, there are three paths to "make the model understand your data":

RAG: Don't modify the model, retrieve reference content in real-time. Good for data that changes often, need for source citations, and large data volumes.

Fine-tuning: Retrain parts of the model with your data. Good for changing model behavior (output style, specialized terminology). But expensive — data changes mean retraining.

Long context: Stuff all your data into the prompt. Models now support huge context windows — Claude handles 200K tokens, Gemini handles 1M. If your data isn't huge, just shoving it in might be simpler than building RAG.

How to choose:

  • Small data (under ~50K characters) → Long context, simplest
  • Large data, frequently updated → RAG
  • Need to change model behavior/style → Fine-tuning
  • Need source citations, explainability → RAG
  • Budget-conscious, don't want to maintain infra → Long context or API calls

Honestly, a lot of people build RAG because it sounds "advanced." But if your data volume is small, long context is probably simpler and more effective. Don't use tech for tech's sake.

RAG Evaluation: How to Know If It's Working

After building RAG, how do you know if it's any good? "I tried a few questions and it seemed fine" isn't enough.

Core Metrics

RAG quality has two dimensions:

Retrieval quality: Are the retrieved chunks relevant to the question? Metrics: Recall@K (how many of top-K results are actually relevant) and MRR (where does the first relevant result rank).

Generation quality: Is the model's answer good given the retrieved context? Metrics: accuracy (is the answer correct), completeness (does it miss key info), and faithfulness (does it make up stuff not in the retrieved context).

That last one — faithfulness — is critical. A common RAG problem is hallucination: the model confidently generates information that wasn't in the retrieved results. You need to specifically check for this.

A Simple Evaluation Method

If you don't want a complex evaluation framework, try this:

  1. Prepare 20-30 test questions covering your knowledge base
  2. Manually annotate "correct answer" and "documents that should be retrieved"
  3. Run your RAG pipeline, compare results
  4. Calculate retrieval recall and answer accuracy
python
1
test_cases = [
2
    {
3
        "question": "How to create a Python virtual environment?",
4
        "expected_docs": ["doc_0"],
5
        "expected_answer_contains": ["venv", "python -m venv"]
6
    },
7
    # ...more test cases
8
]
9
 
10
for case in test_cases:
11
    results = collection.query(query_texts=[case["question"]], n_results=3)
12
    retrieved_ids = results["ids"][0]
13
 
14
    hit = any(doc_id in retrieved_ids for doc_id in case["expected_docs"])
15
    print(f"Q: {case['question'][:40]}... Retrieved hit: {hit}")

For something more formal, check out RAGAS — a dedicated RAG evaluation framework that supports Faithfulness, Answer Relevancy, Context Recall, and other metrics.

Honestly, most people build RAG and never evaluate it. Then users complain, and they discover retrieval recall is only 40% — most questions can't find the right document. Spending half a day on evaluation saves a lot of headaches later.

Pitfalls I've Hit

Pitfall 1: chunk_size Too Large

I started with chunk_size=2000, thinking bigger chunks mean more complete context. Turns out each retrieved chunk was huge, and top_k=5 filled up the context window. The model couldn't find the important bits in all that noise.

Switched to 500 with reranking. Much better. Go small on chunks, use reranking for relevance, rather than big chunks full of noise.

Pitfall 2: No Metadata Filtering

My knowledge base had docs, issue records, and config files mixed together. When users asked "how to configure X," vector search returned both documentation and issue discussions — contradicting each other.

Added metadata filtering:

python
1
results = collection.query(
2
    query_texts=["How to configure Redis"],
3
    where={"type": "documentation"},  # Only search docs, not issues
4
    n_results=5
5
)

Pitfall 3: Embedding Model / Query Language Mismatch

My docs were in Chinese, but users sometimes asked in English. Using a Chinese-only embedding model for English queries gave terrible results.

Fix: use a multilingual embedding model like BGE-M3 or Jina embeddings v3 — handles both Chinese and English well.

Pitfall 4: Forgetting About Document Updates

I only built indexing, not updating. Docs changed but the vector DB still had old versions. Answers referenced outdated info.

Added a simple hash-based update mechanism:

python
1
import hashlib
2
 
3
def get_hash(text):
4
    return hashlib.md5(text.encode()).hexdigest()
5
 
6
# Check during update
7
for chunk in new_chunks:
8
    chunk_hash = get_hash(chunk)
9
    if chunk_hash != stored_hashes.get(chunk_id):
10
        new_embedding = embed(chunk)
11
        collection.update(ids=[chunk_id], embeddings=[new_embedding])

Hands-On: Build a Minimal RAG in 20 Minutes

Enough theory. Here's a minimal working RAG example with ChromaDB + OpenAI:

python
1
# pip install chromadb openai
2
 
3
import chromadb
4
from openai import OpenAI
5
 
6
# Initialize
7
client = chromadb.PersistentClient(path="./my_rag_db")
8
collection = client.get_or_create_collection("docs")
9
openai_client = OpenAI()
10
 
11
# Index documents
12
documents = [
13
    "Python virtual environments can be created with venv: python -m venv myenv",
14
    "Docker containers start with: docker run -d -p 80:80 nginx",
15
    "Git branches can be created with: git checkout -b feature-x",
16
    "SSH keys are generated with: ssh-keygen -t ed25519",
17
]
18
 
19
for i, doc in enumerate(documents):
20
    collection.add(documents=[doc], ids=[f"doc_{i}"])
21
 
22
# Query
23
def ask(question):
24
    results = collection.query(query_texts=[question], n_results=2)
25
    context = "\n".join(results["documents"][0])
26
 
27
    response = openai_client.chat.completions.create(
28
        model="gpt-4o-mini",
29
        messages=[
30
            {"role": "system", "content": f"Answer based on reference material. If not in the material, say you don't know.\n\nReference:\n{context}"},
31
            {"role": "user", "content": question}
32
        ]
33
    )
34
    return response.choices[0].message.content
35
 
36
# Test
37
print(ask("How do I create a Python virtual environment?"))

That's all the code. It runs, answers questions, and can cite sources.

Production needs more: error handling, logging, update mechanisms, reranking, hybrid search. But this minimal version lets you understand the core RAG flow in 20 minutes.

Tool Recommendations

If you want to build a RAG app quickly without starting from scratch:

  • LangChain: Most popular RAG framework. Rich ecosystem, but APIs change frequently and docs sometimes lag behind code.
  • LlamaIndex: Focused specifically on RAG. More focused than LangChain, with a more elegant design for data indexing and querying.
  • Dify: Open-source LLM platform. Drag-and-drop RAG workflow building, no coding required.
  • RAGFlow: Open-source, strong document parsing (tables, image OCR). Great for Chinese document handling.
  • FastGPT: User-friendly UI, good for quickly building knowledge base Q&A systems.

For adding doc Q&A to your own project, LlamaIndex is enough. For building a product for users, Dify or RAGFlow are better fits.

Final Thoughts

RAG isn't a silver bullet. It solves "let the model access your data," but it introduces new complexity: chunking strategy, embedding model selection, vector DB operations, retrieval quality tuning.

The good news in 2026: the tooling is mature. ChromaDB, LlamaIndex, Dify — these tools lower the barrier significantly.

But the core problems — how to chunk docs, how to ensure retrieval quality, how to handle data updates — you still have to tune for your specific use case. No one-size-fits-all config exists.

I'm planning to try Agentic RAG next — letting the Agent decide retrieval strategy instead of hardcoding it. I'll write that up when I get there.

Got questions? Drop them in the comments.

advertisement

RAG in Practice: Building a Knowledge Q&A System That Actually Works — AI Hub