$catMANUAL||~47 min

Context Engineering: The AI Agent Skill That Actually Matters More Than Prompt Engineering

advertisement

Context Engineering: The AI Agent Skill That Actually Matters More Than Prompt Engineering

I've been spending less time writing prompts lately. Not because prompts don't matter — they do — but because I realized the thing that actually determines whether my AI agent does a good job has almost nothing to do with the prompt itself.

You can write the most elegant system prompt in the world. But if the agent doesn't know your project structure, can't remember what happened three steps ago, or gets buried under a pile of useless tool output — it's going to mess up anyway.

That's what Context Engineering solves.

Karpathy recently defined it as "the delicate art and science of filling the context window with just the right information for the next step." The key word is "delicate." Most people just dump everything into the context window and call it done. Then they wonder why the model gets dumber over time. The problem isn't the model. It's your context strategy.

Context Engineering ≠ Prompt Engineering

Let me be clear: Context Engineering isn't just a rebrand of Prompt Engineering. They overlap, but they operate at different levels.

Prompt Engineering asks: how do I write one good instruction so the model gives me a good answer?

Context Engineering asks: at every step of an agent's execution, what should the model be looking at?

The difference is that prompts are static — you write them once. Context is dynamic — it changes as the agent works. Tool results come in, conversation history grows, the user's intent might shift. You have to actively manage what the model "sees" throughout the entire process.

I learned this the hard way. I built a feature for Hermes Agent that automatically writes articles and uploads them to a website. First version was simple: clear system prompt explaining the task, then let it run.

It failed spectacularly. Wrong format, wrong API calls, didn't even check existing articles before writing new ones. Everything I'd specified in the prompt was technically there — the agent just couldn't see it anymore by step 5. Tool output had flooded the context with thousands of lines of JSON. The original instructions were buried.

That's a context failure, not a prompt failure.

How Context Fails: Four Ways

Drew Breunig outlined four failure modes that I've found remarkably accurate:

Context Poisoning: A hallucination enters the context and poisons everything downstream. The agent makes a wrong judgment at one step, that wrong judgment gets written into context, and every subsequent step builds on the error.

I hit this when an agent misread a config value, then based all subsequent operations on that wrong value. The error sat in context the whole time, getting referenced again and again.

Context Distraction: The context gets so long that the model's attention is diluted. After dozens of agent steps, the context window is stuffed with tool outputs, intermediate results, and conversation history. The model starts missing the forest for the trees.

This is painfully obvious when using Claude Code for long coding sessions. After 30+ minutes, the context is nearly full and the model starts making mistakes it wouldn't have made earlier — like using an API you explicitly told it to avoid. It's not stupid. It literally can't see that instruction anymore.

Context Confusion: Irrelevant context messes up the model's judgment. Give an agent too many tools, each with its own description, and it starts picking the wrong ones.

This gets worse with MCP. Hook up a dozen MCP servers, each exposing several tools, and the model is staring at dozens of tool descriptions. Selection accuracy drops fast.

Context Clash: Different parts of the context contradict each other. The system prompt says "be concise" but the few-shot examples are all verbose. The model doesn't know which signal to follow.

Why Everyone's Talking About This in 2026

This isn't really sudden. The problem always existed — people just blamed the model. "GPT-4 is dumb." "Claude doesn't listen." In 2026, models got good enough that the same model performs brilliantly for some people and terribly for others. The difference is context strategy.

Cognition (the Devin company) said it straight: Context Engineering is the number one job of engineers building AI agents. Anthropic published a blog post saying agents often span hundreds of turns and need careful context management.

These aren't academic observations. Look at real products — Claude Code's auto-compact, Cursor's rules system, Windsurf's memories feature — they're all doing Context Engineering.

Karpathy himself shifted from Vibe Coding to Agentic Engineering. In his retrospective, he said LLM agents are now strong enough for real engineering work, but the前提 is you manage their context properly.

Think of it like managing a brilliant person with no memory. You can't expect them to figure everything out from scratch each time. You need to hand them the right information at the right moment.

The Four Core Strategies

LangChain categorizes Context Engineering into four strategies: Write, Select, Compress, and Isolate. Here's how each one works, with examples from tools I actually use.

Write: Save Information Outside the Context Window

The core idea is that not everything needs to live in the context window all the time. Some information can be stored elsewhere and pulled in when needed.

The most common form is a scratchpad. During task execution, the agent saves important intermediate results to a file or state object, then reads them back when needed later.

Anthropic's multi-agent research system does this. Their LeadResearcher writes the research plan to Memory before starting work, because the context window truncates at 200K tokens and the plan would be lost otherwise.

In Hermes Agent, the memory tool serves this purpose. The agent can persist key information — user preferences, project tech stack, past mistakes — and it gets auto-loaded at the start of each session. No need to re-explain everything every time.

Rules files are another Write pattern. Claude Code's CLAUDE.md, Cursor's .cursorrules, Copilot's .github/copilot-instructions.md — they all store project-level instructions in files that the agent reads on startup, instead of cramming everything into the system prompt.

Here's a quick comparison of the major approaches:

  • Claude Code's CLAUDE.md: Read automatically from project root and parent directories. Good for project-wide conventions and tech stack info. Supports per-directory files for module-specific notes.
  • Cursor's .cursorrules: More granular — supports global settings, project-level rules, and per-filetype rules. TypeScript files can have different rules than Python files.
  • Copilot's instructions: Works both in IDE and on GitHub.com for PR reviews and issue comments.
  • Hermes Agent's skills: Instead of one big file, instructions are split into independent skill files loaded dynamically based on the user's request. More modular but requires the agent to figure out which skills to load.

These all serve the same purpose: move project-level context out of the system prompt and into the filesystem, where agents can access it on demand.

Select: Pull In Exactly What You Need

Write solves storage. Select solves retrieval. The agent needs to pick the right information from a large pool of available context.

RAG is the most common Select pattern. You have a knowledge base, the user asks a question, you find the most relevant chunks and stuff them into the context window.

But Select goes beyond RAG. For agents, it also covers tool selection. When your agent has dozens of available tools, you can't dump all their descriptions into context. Research shows that applying semantic search to tool descriptions — similar to RAG — and only returning the most relevant tools improves selection accuracy by 3x.

Windsurf's CEO Varun put it well: indexing code isn't the same as context retrieval. As codebases grow, embedding search becomes unreliable. You need to combine grep, knowledge graphs, re-ranking, and other techniques.

I see this in Hermes Agent's skills system. Not all skills need to be loaded — if the user says "write me an article," only the article-writing skills get loaded, not the deployment skills or the gaming skills.

Compress: Keep Only What's Necessary

The longer an agent runs, the more context bloats. Compress is about periodically slimming it down.

The most direct approach is summarization. Claude Code has an auto-compact feature that kicks in at 95% context usage, summarizing the entire conversation history into a condensed version. This preserves key information while freeing up space.

But summarization has risks. You might lose important details. An agent made a critical decision at step 3 with specific reasoning — after compression, that reasoning is gone, and later steps don't know why that decision was made.

Cognition (Devin's company) fine-tuned a dedicated model just for summarization. That tells you how non-trivial this problem is — you can't just ask GPT to summarize and expect good results.

Another approach is trimming — rule-based filtering instead of LLM-based summarization. Delete messages older than 10 turns, keep only the last few tool call results, remove data that's been superseded by newer information.

In practice, I combine both: trimming for tool output (just keep the important fields), summarization for conversation history (let the LLM extract key points).

Isolate: Split Context Into Pieces

Isolate is my favorite strategy. Instead of managing everything in one massive context window, split context into multiple smaller, independent pieces.

The most common application is multi-agent architectures. Break a complex task into subtasks, assign each to a dedicated sub-agent. Each sub-agent has its own context window, its own tools, its own instructions. They don't share context — they only pass structured results between each other.

Anthropic's multi-agent research system does exactly this. They found that multiple sub-agents with isolated contexts outperformed a single agent with a packed context window, because each sub-agent could focus on a narrower task without distraction.

The tradeoff: Anthropic reported that multi-agent systems use up to 15x more tokens than regular chat. You have to weigh the quality improvement against the cost.

HuggingFace's deep researcher uses a different isolation approach — code sandboxes. The agent outputs code instead of direct tool calls. The code runs in a sandbox, and only the execution results (variable values) get passed back to the LLM. Large data objects stay in sandbox variables, never entering the context window.

This is clever. When a database query returns thousands of rows, you don't need to dump all that raw data into context. Store it in a variable, let the model access specific parts through code when needed.

Context Engineering in Practice: What I Learned from Hermes Agent

Hermes Agent is an open-source AI agent framework I've been using to build automation workflows. Here's what I learned about context management through real usage.

Skills as a Select Strategy

Hermes Agent's skills system is essentially a Select implementation. Each skill is a markdown file with specific task instructions. The agent doesn't load all skills — it looks at the user's request, decides which skills are relevant, and loads only those.

This sounds simple but solves a big problem. I tried putting all skill descriptions into the system prompt once. The agent kept getting confused — user wanted to write an article, but it would start running a deployment workflow instead.

Memory as a Write Strategy

Hermes Agent's memory system lets agents persist information across sessions. But here's the pitfall: injecting too much memory pollutes the context.

I found that if memory stores too much trivial stuff — "last time ran command X," "file was at /tmp/xxx" — it actually interferes with the agent's judgment. Now I have a rule: only store information useful across sessions. "User prefers concise responses," "project uses Next.js + Vercel," "API key is in .env." Temporary stuff doesn't get saved.

Sub-agents as an Isolate Strategy

The delegate_task tool is a textbook Isolate implementation. When a task is complex, the main agent can split subtasks to sub-agents. Each sub-agent has its own context, isolated from the main agent's conversation history.

I used this for batch article rewriting — distributing 10 articles across 3 sub-agents running in parallel. Each sub-agent only needed to focus on its assigned articles. Much better results than having one agent process everything sequentially.

But I also hit failures. Once I asked a sub-agent to upload an article to a website. It didn't know the API key was in an environment variable (the main agent's context had this info, but the sub-agent didn't). After that, I made sure critical information gets explicitly passed in the sub-agent's context parameter.

Common Pitfalls

Pitfall 1: Stuffing everything into context

Most common mistake. You write a 3,000-word system prompt, add 5 few-shot examples, throw in the project README. The agent gets dumber. Too much information dilutes attention. Like handing a new hire a 100-page onboarding manual — they don't know where to start.

Pitfall 2: Writing but never cleaning

Agents generate tons of intermediate data — tool outputs, temp calculations, error logs. If you don't clean up, the context window fills until it triggers truncation or auto-compact.

I ran an automation task on Hermes Agent that executed 80+ steps. Context overflowed, auto-compact kicked in, and lost critical configuration info. Everything after that was chaos.

Pitfall 3: Memory injection without filtering

Many agent frameworks auto-inject memories into context. ChatGPT had this issue — Simon Willison shared at the AI Engineer World's Fair that ChatGPT pulled his location from memory and unexpectedly injected it into an image generation request. Users felt the context window "no longer belonged to them."

Pitfall 4: Ignoring tool description quality

People spend ages optimizing prompts but ignore tool descriptions. For agents, tool descriptions are how it understands "what can I do?" A bad tool description means the agent doesn't know when to use it, uses wrong parameters, or skips it when it should use it. Keep tool descriptions focused: what it does, when to use it, what the parameters mean. 100-200 words per tool is about right.

Pitfall 5: Rules files that never get updated

CLAUDE.md or .cursorrules — people write them once and forget. But projects evolve. Tech stacks change, new modules get added, new pitfalls discovered. Outdated rules give the agent stale information.

My habit: every time I hit an agent-related pitfall, I write the lesson into CLAUDE.md. Next time the agent won't make the same mistake. That's the Write strategy in continuous iteration.

Practical Tips You Can Use Today

1. Figure out where your context is going

Before optimizing, know where your tokens are spent. If you're using LangChain/LangGraph, LangSmith can trace token usage per step. If not, at least log input tokens for each LLM call. You'll probably discover that tool outputs eat most of your budget, not your prompt.

2. Post-process tool returns

Easy win. Most tools return way more data than the model needs. A search API returns 10 results with title, URL, snippet, date, author, and a bunch of metadata. The model probably only needs title and snippet. Filter out the rest before injecting into context. Can save 30-50% of tokens.

3. Use rules files instead of long system prompts

If you're building code agents, use CLAUDE.md, .cursorrules, or similar files for project-level instructions. Benefits: can be split by directory, version-controlled, and loaded on demand.

4. Design your memory system carefully

Distinguish short-term memory (current session) from long-term memory (cross-session). Short-term uses context window or scratchpad. Long-term uses external storage with on-demand retrieval. Not everything deserves to be remembered — set a filter criterion.

5. Consider multi-agent splitting

If your agent's context regularly exceeds 50K tokens, consider splitting into sub-agents. Each sub-agent handles a narrower task with its own context window. But don't over-split — coordination between agents has costs too.

The Bottom Line

Context Engineering sounds fancy, but it's really just answering one question: at each moment of an agent's execution, what should the model see?

You don't need new frameworks or exotic techniques. You need to:

  • Understand what your agent is doing
  • Know what's in its context window
  • Figure out what's signal and what's noise
  • Give it the right information at the right time

This doesn't replace good prompting. You still need well-written prompts. But a good prompt alone isn't enough. You have to manage the entire lifecycle of context — writing, selecting, compressing, isolating.

Karpathy called it "the delicate art and science." I think that's right. It's not a formula. It's a way of thinking.

Next time your agent isn't performing well, don't rush to rewrite the prompt. First, look at what's actually in its context window. Is there stuff that shouldn't be there? Is there stuff that should be there but isn't?

The answer is almost always in the context.

I'm planning to write a follow-up on advanced Context Engineering — specifically how to implement memory systems and sub-agent context isolation in Hermes Agent. Drop a comment if there's something specific you want me to cover.

References

advertisement

Context Engineering: The AI Agent Skill That Actually Matters More Than Prompt Engineering — AI Hub