AI Coding Tools Eating Your Tokens? headroom Saves 60-95% on Context Costs
I've been using Claude Code daily for the past few months, and one thing keeps bugging me: token costs add up fast. Especially when the agent needs to read a whole codebase, scan logs, or pull from RAG — the context window fills up before the model even starts doing useful work.
Last week I spotted a project called headroom on GitHub Trending. It had gained 12,000+ stars in a single week, hitting 41,000+ total. The pitch: "compress everything your AI agent reads before it reaches the LLM — 60-95% fewer tokens, same answers." Sounded too good to be true. But the benchmark data looked legit, so I gave it a shot.
After spending an afternoon with it, I'm genuinely impressed. Here's what I found.
The Problem: Token Bloat in AI Agents
If you use AI coding tools regularly, you know the feeling. Every file the agent reads generates tool output. Every API call returns JSON. Every log dump adds thousands of tokens. Before you know it, a single task burns through 50,000+ tokens of input alone.
The cost is one thing. But there's a subtler issue: when the context gets too long, the model's attention分散. It's like asking someone to read 100 pages of documentation and then answer a specific question — they'll probably miss the key detail buried on page 47.
headroom solves this by compressing the context before it reaches the model. It strips out noise, deduplicates information, and preserves what actually matters.
What headroom Actually Does
headroom is an open-source context compression layer that sits between your AI agent and the LLM. It works in three modes:
- Library:
compress(messages)in Python or TypeScript, inline in any app - Proxy:
headroom proxy --port 8787— zero code changes, any language - MCP server: Install it as an MCP server for any MCP client
The architecture is straightforward. A ContentRouter detects the content type and routes it to the right compressor — SmartCrusher for JSON, CodeCompressor for code (AST-aware), and Kompress-base for general text (a model they trained on HuggingFace).
The key feature: compression is reversible. Original data stays in a local CCR (Cacheable Compressed Representation) cache. If the model needs the full original, it can call headroom_retrieve to get it back. This addresses the biggest concern people have about compression — "won't it lose information?"
Tool Compatibility
headroom has first-class support for the major AI coding tools:
- Claude Code:
headroom wrap claudewith--memoryand--code-graphoptions - OpenAI Codex:
headroom wrap codex, shares memory with Claude Code - Cursor:
headroom wrap cursor— prints config for you to paste once - Aider:
headroom wrap aider— starts proxy and launches - Copilot CLI:
headroom wrap copilotwith subscription mode support - OpenClaw: Installs as a ContextEngine plugin
For anything else using an OpenAI-compatible API, headroom proxy works as a drop-in. MCP clients just need headroom mcp install.
The Numbers
headroom's official benchmarks on real workloads:
- Code search (100 results): 17,765 → 1,408 tokens — 92% savings
- SRE incident debugging: 65,694 → 5,118 tokens — 92% savings
- GitHub issue triage: 54,174 → 14,761 tokens — 73% savings
- Codebase exploration: 78,502 → 41,254 tokens — 47% savings
That last one is only 47%. Not every scenario hits 90%+. If the content is already dense with no redundancy, there's less to compress. But most AI agent work — reading logs, scanning code, processing JSON — has plenty of noise.
Accuracy benchmarks:
- GSM8K (math): baseline 0.870, headroom 0.870 — no drop
- TruthfulQA (factual): baseline 0.530, headroom 0.560 — actually improved
- SQuAD v2 (QA): 97% accuracy with 19% compression
- BFCL (tool calling): 97% accuracy with 32% compression
The TruthfulQA improvement is interesting. My guess is that compressing out noise helps the model focus on the actual facts.
Getting Started
Installation is simple:
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
For proxy mode, just:
| 1 | |
This starts a local proxy that intercepts Claude Code's requests, compresses them, and forwards to Anthropic's API. The experience is identical to using Claude Code normally, but with significantly lower token costs.
To see actual savings:
| 1 | |
My Real-World Experience
I tested headroom on a code review task with a Next.js project (~200 files).
Without headroom: ~80,000 input tokens per session With headroom: ~35,000 input tokens
Over 50% savings. The code review quality was identical — headroom's AST-aware compressor preserves function signatures, type definitions, and control flow while stripping comments, blank lines, and redundant imports.
One thing to note: the first run is slower because Kompress-base needs to download (~200MB). After that it's cached, and compression takes milliseconds.
headroom is particularly effective on JSON tool output. One API response I tested went from 3,000+ tokens to 800. The model still understood all the user data — it just lost the metadata it didn't need.
How Reversible Compression Works
headroom's CCR mechanism deserves more explanation.
Normal compression is one-way. You go from 10,000 tokens to 2,000, and those 8,000 tokens of information are gone. CCR is different — it stores the originals locally and attaches a "retrieval handle" to the compressed version.
When the LLM processes compressed context and finds something unclear (like needing the full log instead of a summary), it calls headroom_retrieve to get the original back.
It's similar to RAG — give an overview first, dive deeper on demand. But headroom automates this. You don't need to design retrieval strategies manually.
Three Compression Engines Explained
headroom has three core compressors. Understanding them helps you estimate how much you'll save.
SmartCrusher: JSON Compression
JSON is the most common tool output format. API responses, config files, database results — all JSON.
SmartCrusher flattens nested objects, deduplicates arrays, removes null/default values, and truncates long strings. A typical API response with status/data/meta nesting and user objects full of email, timestamps, and other noise gets compressed to just the fields the model actually needs.
Compression rate: 60-70%.
CodeCompressor: AST-Aware Code Compression
Code compression is harder. You can't just delete "unimportant-looking" code — a comment might contain critical business logic, a blank line might be a visual separator.
CodeCompressor uses AST to understand code structure. It preserves function signatures, type definitions, control flow, while compressing implementation details and removing decorative comments.
I tested it on a Python module: 500 lines compressed to ~200. All function signatures and key logic intact, implementation details summarized.
Trade-off: if the model needs to modify a specific function's implementation, the compressed version might not be enough. That's where CCR's retrieve comes in.
Kompress-base: General Text Compression
This is headroom's own trained model on HuggingFace, designed for unstructured text — logs, docs, error messages. It does "intelligent summarization": keeping key lines from long logs, extracting core info from doc paragraphs, merging repeated patterns.
A 500-line debug log might compress to 10-20 key lines. The model doesn't need to see every [DEBUG] Processing item 42/1000 — it just needs to know something went wrong at item 42.
CacheAligner: The Unsung Hero
headroom has a component called CacheAligner that most people overlook, but it's crucial for saving money.
Both Anthropic and OpenAI offer prompt caching — if consecutive requests share the same prefix, subsequent requests reuse the cache at half price. But cache hits require the prefix to be identical.
The problem: with AI agents, every request adds new messages to the conversation history, so the prefix keeps changing and cache hit rates drop.
CacheAligner fixes this by putting the changing parts (new messages) at the end and keeping the stable parts (system prompt, tool descriptions, compressed history) at the front. This lets provider KV caches stay warm.
In testing, CacheAligner improved prompt caching hit rates from 30% to 70%+. That alone saves another 20-30% on input costs.
headroom vs Native Prompt Caching
People ask: doesn't Claude already have prompt caching? Why do I need headroom?
They solve different problems:
Prompt caching handles repeated content. If you send 10 consecutive requests with the same system prompt and tools, the shared parts cache after the first request at half price.
headroom handles content volume. Even if every request has different content (new conversation, new tool output), headroom compresses the size.
They're complementary. Use headroom to shrink content, then let prompt caching handle the stable prefix. Double savings.
My test with Claude Code exploring a codebase over 10 requests:
- Nothing: 85,000 input tokens
- Prompt caching only: ~50,000 tokens
- headroom only: ~35,000 tokens
- Both: ~20,000 tokens
76% total savings. At 3-4 hours of daily Claude Code usage, that's $100-200/month.
headroom learn: Learning from Failures
headroom has a feature called headroom learn that I think is really clever.
It analyzes your past failed AI agent sessions (where the model got confused, made wrong assumptions, or needed multiple retries) and extracts lessons, automatically writing them to CLAUDE.md or AGENTS.md.
Example: if Claude Code needed three attempts to modify a database schema — missing foreign keys the first time, forgetting migration files the second — headroom learn would add a note: "When modifying database schemas, check foreign key constraints and migration files first."
Next time Claude Code encounters a similar task, it reads this lesson automatically. The feature essentially creates experiential learning for your AI agent.
Image Compression Too
headroom recently added image compression. If your agent processes screenshots, UI mockups, or images in code, headroom uses ML to compress them by 40-90%.
It analyzes image content, preserving what helps the agent understand (text, key graphics, UI elements) while compressing the rest (backgrounds, decorative elements).
This is particularly useful for UI development with Claude Code — you can show it screenshots directly without spending thousands of tokens describing UI details.
Real Cost Breakdown
Let me do the math for my daily usage (3 hours of Claude Code, code review + feature work + debugging).
Without headroom:
- Input: ~80,000 tokens per session × $3/million = $0.24
- Output: ~30,000 tokens per session × $15/million = $0.45
- 3-4 sessions per day: $2-3/day
- Monthly: $60-90
With headroom:
- Input compressed 60%: 80K → 32K tokens
- Output compressed 30%: 30K → 21K tokens
- Daily: ~$0.8-1.2
- Monthly: $24-36
About 50% savings. More if you use expensive models (Opus) or use them more heavily.
headroom itself is open source and free. The only overhead is local CPU/memory — negligible in practice, milliseconds per compression.
Pitfalls
A few issues I ran into:
- First-run model download is slow. Kompress-base is ~200MB. After that it's cached.
headroom wrapmodifies environment variables. If your terminal has complex proxy settings, watch for conflicts.- Not everything compresses 90%+. The 92% figure is best-case (repetitive logs). Real-world 50-70% is more typical.
- MCP mode needs client support. Not all MCP clients call
headroom_retrievecorrectly. - Output compression changes reply style. Responses become more terse. If you're used to detailed explanations, it takes adjustment.
Comparison with Alternatives
Other token optimization approaches:
- Prompt caching (Anthropic/OpenAI native): Only caches repeated prefixes, doesn't compress content
- RAG: Manual retrieval strategy design, not general compression
- Manual context management: Writing your own filtering logic — tedious and hard to maintain
- Academic tools (LLMLingua, AutoCompressor): Research-stage implementations, not production-ready
headroom's advantage is being a one-stop solution. Any content type, any agent, any provider. Runs locally, supports Python and TypeScript SDKs, and has an MCP server mode.
Configuration and Tuning
Default settings work for most scenarios, but you can tune:
- Compression rate target: Set a ceiling via environment variable (e.g., "max 50% compression")
- Content type whitelist: Choose which types to compress — you might skip code but compress logs
- Cache TTL: How long CCR retains originals (default 24 hours)
Manage via headroom config or set programmatically.
What's Next
headroom is evolving fast. I plan to keep testing it across different scenarios, especially the cross-agent memory sharing feature. If you use Claude Code alongside other tools, this could be quite valuable.
The headroom learn feature is also worth deeper exploration. The idea of AI agents learning from their own mistakes is compelling.
Questions? Drop them in the comments.
- Written June 21, 2026. Check the GitHub repo for current version info. headroom is Apache 2.0 licensed and free for commercial use.*