AI Agent Memory Systems: From Context Windows to Persistent Memory — Everything I Learned the Hard Way

Last week I hit a nasty bug while debugging Hermes Agent. After about 20 turns of conversation, the agent suddenly forgot something the user had said three minutes ago. Not a semantic misunderstanding — it completely blanked. Took me half a day to figure out the Context Window had overflowed and early conversation got silently truncated.

That sent me down a rabbit hole studying AI Agent memory systems. Spent about a week testing everything on the market, hit a bunch of walls, and picked up some hard-won lessons. Writing this up partly as a reference for myself, and partly for anyone else dealing with the same headaches.

What Context Window Actually Means

Let's start with something basic that most people think they understand but don't fully grasp.

The Context Window is the total amount of information a model can "see" in a single call. Every message you send, every tool call result, every system prompt token — it all eats into this window. GPT-4o has roughly 128K tokens, Claude 3.5 Sonnet has 200K, and Gemini 1.5 Pro claims up to 1M.

Sounds generous, right? It's not. Not even close.

Here's the math for a typical AI Agent conversation:

System prompt: roughly 2,000–5,000 tokens
Tool definitions (function schemas): 500–2,000 tokens per tool, so 10 tools = 5,000–20,000 tokens
Conversation history: 500–2,000 tokens per turn (user message + agent reply + tool call results)
Tool return data: this one's brutal — a single API call can return 3,000–10,000 tokens

So a "128K" window might only leave 50–80K tokens for actual conversation history. If your agent makes a few tool calls (searching the web, reading files, calling APIs), that window drains fast.

I measured this on Hermes Agent: a complete code review task (read files + search + analyze + generate report) burned through 30–40K tokens. With a 128K window, you get maybe three or four rounds before hitting the wall.

Short-Term Memory: Sliding Windows and Summarization

The simplest fix is a sliding window. Conversation too long? Drop the early messages and keep only the last N turns.

python

def sliding_window(messages, max_turns=20):
    """Brute force sliding window"""
    system = [m for m in messages if m["role"] == "system"]
    history = [m for m in messages if m["role"] != "system"]
    return system + history[-max_turns * 2:]  # Each turn = user + assistant

Simple to implement, but the downside is obvious: important early information gets lost. The user mentioned in turn one that their project uses Python 3.11 + FastAPI, and twenty turns later that context is gone. The agent might start suggesting Flask. Awkward.

A step up from sliding windows is summarization. Every N turns, have the LLM compress the older conversation into a summary, then substitute the summary for the original messages.

python

def summarize_and_compress(messages, threshold=15):
    """When conversation exceeds threshold turns, compress the first half"""
    system = [m for m in messages if m["role"] == "system"]
    history = [m for m in messages if m["role"] != "system"]
 
    if len(history) <= threshold * 2:
        return messages
 
    to_summarize = history[:len(history) // 2]
    summary_text = call_llm(
        f"Summarize this conversation, preserving key details:\n{format_messages(to_summarize)}"
    )
 
    summary_msg = {"role": "system", "content": f"Previous conversation summary: {summary_text}"}
    remaining = history[len(history) // 2:]
    return system + [summary_msg] + remaining

Better than a sliding window, but two problems:

Each compression costs an extra LLM call — adds latency and cost
Compression loses detail. "The user's server is a 2-core 4GB Alibaba Cloud ECS" might get compressed to "the user has a cloud server." The specifics evaporate.

I tried a few variations:

Tiered compression: last 5 turns kept verbatim, turns 5–10 lightly compressed (strip verbose tool outputs), turns 10+ heavily compressed (keep only key decisions)
Key information extraction: use the LLM to pull out key entities from each turn (file names, variable names, tech stack, decisions), stored as structured "memory cards"
Hybrid approach: summary + key info cards, both retained

The hybrid approach performed best in testing, but it's also the most complex to implement.

Long-Term Memory: Vector DBs, Knowledge Graphs, and File Storage

Short-term memory handles "within a single conversation." The harder problem is memory across conversations.

When a user discusses deployment strategy today and comes back tomorrow, the agent should remember yesterday's discussion. This requires persistent long-term memory.

Three main approaches exist:

Approach 1: Vector Database

The most popular approach. Embed conversation content, user preferences, and key facts, then store them in a vector database (Pinecone, Qdrant, ChromaDB, etc.). At the start of each new conversation, retrieve relevant memories based on the current topic.

python

def store_memory(text, user_id):
    embedding = get_embedding(text)
    vector_db.upsert(
        vectors=[{"id": generate_id(), "values": embedding, "metadata": {"user_id": user_id, "text": text}}]
    )
 
def retrieve_memory(query, user_id, top_k=5):
    query_embedding = get_embedding(query)
    results = vector_db.query(vector=query_embedding, top_k=top_k, filter={"user_id": user_id})
    return [r["metadata"]["text"] for r in results]

Pros: easy to implement, fast retrieval, great ecosystem.

Cons: semantic retrieval isn't the same as precise retrieval. When the user asks "what port did I use for deployment last time," vector search might return "the user discussed deployment strategy" but not the actual port number. "Port number" and "deployment strategy" sit close together in embedding space, but whether it was 3000 or 8080, embeddings don't reliably remember.

Running vector-based memory on Hermes Agent, I noticed something interesting: embeddings work well for "topic relevance" but poorly for "factual specifics." Ask "what tech options did we discuss before" and vector search nails it. Ask "did we pick PostgreSQL or MySQL" and it often can't answer.

Approach 2: Knowledge Graph

More structured than vector databases. Organize information as "entity-relationship-entity" triples.

For example:

(User) - [uses] -> (Python 3.11)
(User) - [deployed_on] -> (Alibaba Cloud ECS)
(Project) - [depends_on] -> (FastAPI 0.100+)
(Last deployment) - [used_port] -> (8080)

Zep uses this approach. It automatically extracts entities and relationships from conversations, building a temporal knowledge graph. Retrieval doesn't just find semantically similar content — it can traverse relationship chains to find connected information.

python

from zep_cloud.client import Zep
 
client = Zep(api_key="your-key")
 
client.memory.add(session_id="session-1", messages=[
    {"role": "user", "content": "My project uses PostgreSQL, deployed on Alibaba Cloud ECS, port 5432"},
])
 
memory = client.memory.get(session_id="session-1")
 
# Returns not just text snippets, but extracted entities and relationships

Pros: more accurate retrieval for factual information, handles complex relational queries.

Cons: high cost to build and maintain, entity extraction quality depends on LLM capability, slow cold start.

Approach 3: File Storage + Structured Index

The simplest but sometimes most reliable approach. Store memories directly as files (JSON, Markdown, SQLite) with simple keyword or tag indexing.

Hermes Agent uses this approach. Each memory is a text record, categorized by target (user/memory), supporting add/replace/remove operations. At conversation start, all memories get injected into the system prompt.

python

memories = {
    "user": [
        "User is a full-stack developer, mainly uses Python and TypeScript",
        "User prefers concise response style",
        "User is in UTC+8 timezone"
    ],
    "memory": [
        "Project uses Next.js + Vercel deployment",
        "Database is Turso (SQLite edge database)",
        "Git repo at github.com/0311yet/aaiweb"
    ]
}

Pros: simple implementation, no external dependencies, precise retrieval (full injection), easy to debug.

Cons: as memory entries grow, they consume significant Context Window space. Currently Hermes Agent has about 60–70 memories, injecting 3,000–4,000 tokens. If entries grow to hundreds, this approach becomes impractical.

Framework Comparison: What's Actually Out There

I tested several memory frameworks seriously. Here's what I found.

Mem0

25K+ GitHub stars, YC S24 project — the most watched player in this space.

Core idea is a "Universal Memory Layer" that adds memory to any AI app. Three layers:

User Memory: cross-session preferences and history
Session Memory: current conversation context
Agent Memory: knowledge the agent itself has learned

Their April 2026 algorithm update uses "single-pass ADD extraction" — only adding new memories, never overwriting old ones. Combined with entity linking and multi-signal retrieval (semantic + BM25 keyword + entity matching), it scored 91.6 on the LoCoMo benchmark, a 20-point improvement.

The API is genuinely clean:

python

from mem0 import Memory
 
m = Memory()
m.add("I prefer Python for backend, TypeScript for frontend", user_id="user-1")
results = m.search("what languages does the user prefer", user_id="user-1")

But there's a catch: it defaults to OpenAI's embedding model. If you want to use local models or a different embedding provider, configuration gets messy. I spent a couple hours switching it to Jina embeddings.

Also, there's a significant feature gap between Mem0's hosted platform and the self-hosted open source version. Many advanced features (Entity Linking, Multi-signal Retrieval) are platform-only. The self-hosted version is more basic.

Letta (formerly MemGPT)

This project takes an interesting angle. Instead of bolting external memory onto an LLM, it has the LLM manage its own memory.

The core paper "MemGPT: Towards LLMs as Operating Systems" analogizes Context Window to "main memory" and external storage to "disk." The LLM can read and write its own memory through function calls, like a program using system calls to manage memory.

python

functions = [
    "core_memory_append(key, value)",
    "core_memory_replace(key, old, new)",
    "archival_memory_insert(content)",
    "archival_memory_search(query)",
    "conversation_search(query)",
]

Elegant design, but in practice I found a problem: LLMs aren't great at managing their own memory. They don't know when to store, when to query, when to update. In testing, the agent frequently stored trivial information in core memory (like "user said 'uh huh'") while missing important technical decisions.

The hosted version has a nice visual memory management interface where you can manually edit agent memories. But the open source version's documentation is sparse and the learning curve is steeper.

Zep

Zep positions itself as a "Context Engineering Platform" — not just memory, but full context engineering.

Its standout feature is the temporal knowledge graph. Instead of storing conversations as plain text, it automatically extracts entities and relationships, building a knowledge graph that evolves over time. For example:

Day 1: user says "I'm using PostgreSQL"
Day 3: user says "I switched my database to MySQL"
Zep records this transition, knowing the user's "current state" is MySQL, not PostgreSQL

This "temporal" capability is genuinely valuable in production. Most memory systems only know "the user mentioned PostgreSQL" — they don't know "the user later switched to MySQL." Zep's graph handles these state transitions.

They claim sub-200ms latency with SOC2 Type 2 and HIPAA compliance. Clearly targeting enterprise use cases.

One concern: the open source version has limited functionality. The core Graph RAG and Context Assembly logic lives mainly in the cloud platform. Self-hosted experience takes a significant hit.

Others Worth Watching

Cognee: open source memory framework with multiple vector DB backends, solid GraphRAG implementation
MemoryScope (Zhipu AI): memory management optimized for Chinese-language scenarios, better Chinese entity extraction
LangChain Memory: LangChain's built-in memory module — the most options but also the most confusing (Buffer, Summary, VectorStore, Combined, etc.). Makes sense if you're already in the LangChain ecosystem.

What It Actually Costs: Running the Numbers

Many people treat memory systems as a "just add it" feature without considering ongoing costs. Let me run the actual numbers.

Say you have an AI Agent with 50 interactions per day, each producing 3 memories on average.

Embedding cost: each storage call needs an embedding API call. With OpenAI text-embedding-3-small at $0.02/million tokens, each memory averages 100 tokens, 150 per day, 4,500 per month. Total embedding consumption around 450K tokens — less than $0.01. Basically free.

Vector database cost: Pinecone's free tier supports 1M vectors. Qdrant and ChromaDB can be self-hosted. Small scale (under 100K entries) costs nothing.

Retrieval is where it gets real: each conversation start triggers a memory retrieval. Mem0's hosted service: free tier = 1,000 calls/month, Pro = $49/month. Self-hosted, each retrieval's LLM call costs roughly $0.001–0.005 depending on the reranking model.

Summarization is the most expensive part: compressing every 15 turns costs 2,000–5,000 tokens per call. With GPT-4o, that's $0.01–0.02 per compression. Do it 3–5 times daily, and you're at $1–3/month.

Total monthly cost for a moderately used agent's memory system: roughly $5–15. Not bad, but it scales linearly with user count. If you're building a SaaS serving hundreds of users, costs add up fast.

Ways to save:

Use local embedding models (Jina embeddings v3 or BGE-M3) — embedding cost drops to zero
Self-host an open source vector database (Qdrant or ChromaDB) — no platform fees
Use small models for initial retrieval screening, large models for final ranking — fewer expensive calls
Use cheaper models for summarization (GPT-4o-mini or Claude Haiku) — comparable quality at 1/10 the price

My Real-World Experience: Hermes Agent's Memory System

Enough framework talk — let me share what I actually deal with.

Hermes Agent's memory system is the most barebones approach: file storage + full injection. Each memory is a text record stored in ~/.hermes/memories/, split into user and memory targets. Every conversation starts by injecting all memories into the system prompt.

The good: simple, reliable, easy to debug. I can open the file and see exactly what the agent remembers. I can manually edit it too.

The bad:

Problem 1: Memory Drift

The agent sometimes stores incorrect information. Once it recorded "user prefers short replies" as "user doesn't like replies." One word off, completely different meaning. Worse, once stored, this memory loads every conversation and distorts agent behavior.

Fix: regularly audit and clean memories manually. I've gotten into the habit of reviewing the memory file weekly, deleting stale or incorrect entries.

Problem 2: Memory Conflicts

Two contradictory memories coexist:

"User prefers Python 3.11"
"User recently switched to Python 3.12"

The agent doesn't know which to trust. The replace operation exists for this — overwrite old with new — but the agent doesn't always detect conflicts. Sometimes contradictory memories coexist for weeks.

Problem 3: Memory Bloat

Memory entries grow over time. My Hermes Agent currently has about 60–70 entries, injecting 3,000–4,000 tokens. Still manageable, but if it grows to hundreds of entries, it'll eat into usable Context Window space.

Future plan: migrate low-frequency memories (unused for over a month) to a vector database, retrieving them only when needed. High-frequency memories stay fully injected.

Pitfalls Nobody Talks About

A few issues I encountered in production that I rarely see discussed online.

Pitfall 1: Memory Security

Agent memories can contain sensitive information. Users tell agents their API keys, server passwords, personal details — all stored in memory files or vector databases. If the agent is shared among multiple users, or if memory storage gets compromised, the consequences are serious.

Most memory frameworks handle this crudely. Mem0's platform has basic user isolation, but self-hosted versions require you to handle it yourself.

Pitfall 2: No Forgetting Mechanism

Humans naturally forget unimportant information. AI agents don't. A memory stored three months ago, even if completely outdated, gets injected verbatim into the system prompt. This can cause the agent to make decisions based on stale information.

Good memory systems should have TTL (Time To Live) or decay mechanisms. Memories that haven't been retrieved in a long time should carry less weight. But most frameworks don't build this in.

Pitfall 3: Multi-Agent Memory Sharing

If your system has multiple agents (one for coding, one for testing, one for deployment), how do they share memories?

Full sharing means each agent's Context Window gets bloated by other agents' memories. Full isolation means agents can't collaborate.

The practical approach is layered memory:

Global memory: shared by all agents (project info, user preferences, tech stack)
Role memory: agent-specific (task history, learned experiences)
Task memory: shared within current task, archived after completion

How to Choose: A Decision Framework

After all this testing, I think the choice comes down to three dimensions.

First, what granularity of memory do you need?

Within-conversation only → sliding window + summarization is enough
Cross-session user preferences → vector DB + simple key-value store
Precise factual retrieval (port numbers, version numbers, config values) → knowledge graph or structured storage
State transitions (user switching from plan A to plan B) → temporal knowledge graph

Second, what's your budget?

Memory systems aren't free. Rough estimates:

Embedding: ~$0.02/million tokens (OpenAI text-embedding-3-small)
Vector DB: Pinecone free tier handles 1M vectors
Mem0 platform: free = 1,000 calls/month, Pro = $49/month
Budget option: self-hosted ChromaDB/Qdrant + local embedding models (Jina v3)

Third, what's your tech stack?

Already using LangChain → use LangChain Memory modules
Enterprise compliance needed → Zep Cloud
Cleanest API → Mem0
Agent self-manages memory → Letta
Full control → DIY (file + vector DB + simple retrieval logic)

What's Next

AI Agent memory is in a "many options, none mature" phase. No single framework solves everything, and most require customization for production use.

My take: short-term, a hybrid of "file storage + vector retrieval + summarization compression" is the most practical. Knowledge graphs sound great but the build and maintenance cost is high — better suited for enterprise teams with dedicated resources.

Next I'm planning two things for Hermes Agent: integrate Mem0 for long-term memory (replacing the current pure-file approach), and add a memory TTL mechanism to auto-clean stale entries. Will write up the results when I'm done.

If you're working on AI Agent memory systems, I'd love to hear what's working for you. The frameworks you've tried, the pitfalls you've hit, the approaches that actually hold up in production.

Written June 13, 2026, based on the author's hands-on experience with Hermes Agent, Mem0, Letta, and Zep. Technical details may change as frameworks evolve — check each project's latest docs.*

1	`def sliding_window(messages, max_turns=20):`
2	`"""Brute force sliding window"""`
3	`system = [m for m in messages if m["role"] == "system"]`
4	`history = [m for m in messages if m["role"] != "system"]`
5	`return system + history[-max_turns * 2:] # Each turn = user + assistant`

1	`def summarize_and_compress(messages, threshold=15):`
2	`"""When conversation exceeds threshold turns, compress the first half"""`
3	`system = [m for m in messages if m["role"] == "system"]`
4	`history = [m for m in messages if m["role"] != "system"]`
5
6	`if len(history) <= threshold * 2:`
7	`return messages`
8
9	`to_summarize = history[:len(history) // 2]`
10	`summary_text = call_llm(`
11	`f"Summarize this conversation, preserving key details:\n{format_messages(to_summarize)}"`
12	`)`
13
14	`summary_msg = {"role": "system", "content": f"Previous conversation summary: {summary_text}"}`
15	`remaining = history[len(history) // 2:]`
16	`return system + [summary_msg] + remaining`

1	`def store_memory(text, user_id):`
2	`embedding = get_embedding(text)`
3	`vector_db.upsert(`
4	`vectors=[{"id": generate_id(), "values": embedding, "metadata": {"user_id": user_id, "text": text}}]`
5	`)`
6
7	`def retrieve_memory(query, user_id, top_k=5):`
8	`query_embedding = get_embedding(query)`
9	`results = vector_db.query(vector=query_embedding, top_k=top_k, filter={"user_id": user_id})`
10	`return [r["metadata"]["text"] for r in results]`

1	`from zep_cloud.client import Zep`
2
3	`client = Zep(api_key="your-key")`
4
5	`client.memory.add(session_id="session-1", messages=[`
6	`{"role": "user", "content": "My project uses PostgreSQL, deployed on Alibaba Cloud ECS, port 5432"},`
7	`])`
8
9	`memory = client.memory.get(session_id="session-1")`
10
11	`# Returns not just text snippets, but extracted entities and relationships`

1	`memories = {`
2	`"user": [`
3	`"User is a full-stack developer, mainly uses Python and TypeScript",`
4	`"User prefers concise response style",`
5	`"User is in UTC+8 timezone"`
6	`],`
7	`"memory": [`
8	`"Project uses Next.js + Vercel deployment",`
9	`"Database is Turso (SQLite edge database)",`
10	`"Git repo at github.com/0311yet/aaiweb"`
11	`]`
12	`}`

1	`from mem0 import Memory`
2
3	`m = Memory()`
4	`m.add("I prefer Python for backend, TypeScript for frontend", user_id="user-1")`
5	`results = m.search("what languages does the user prefer", user_id="user-1")`

1	`functions = [`
2	`"core_memory_append(key, value)",`
3	`"core_memory_replace(key, old, new)",`
4	`"archival_memory_insert(content)",`
5	`"archival_memory_search(query)",`
6	`"conversation_search(query)",`
7	`]`