Context engineering for Claude: improving token efficiency.

Token bills are the new cloud bills. If you run an agentic coding tool all day, you've probably watched the number climb and wondered where it all went. A lot of it goes to noise: the agent runs go test, dumps four thousand tokens of passing-test noise into the context, re-reads the same 600-line file for the third time, then pastes another wall of git status output. None of that is the work. It's the exhaust.

Most people reach for shorter prompts first. Wrong lever. The tokens aren't in your wording, they're in the exhaust, which makes token efficiency a context-engineering problem more than a writing one.

I've been running two tools to cut that exhaust for a while now: rtk and context-mode. And you know how I love small changes with low effort that give huge improvements in optimization. Same idea as compressing a Redis cache down by ~90% with zstd, except here the bytes you're saving are tokens. You install a thing, you adjust a setting or two, and the bill drops. This is one of those.

A note on scope. Everything here is about the CLI version of Claude Code. All of these tools work with Gemini CLI and Codex too, but Claude Code is where I tested, so that's what I'll talk about.

It's not just the bill.

The obvious reason to trim tokens is cost. The less obvious reason is quality.

Long, junk-filled context doesn't just sit there harmlessly until you hit the window limit. It actively drags the model down. Accuracy slides as the input grows, well before the window is full, and models are especially bad at using information buried in the middle of a long context. You feel it in practice: an agent that's read fifty files and three log dumps gets vague in a way it wasn't twenty minutes earlier.

So trimming the exhaust does two things at once. It saves money, and it keeps the model sharper for longer. That's a rare combination, and it's why I bother.

rtk: compress command output before it hits the model.

rtk is a CLI proxy. It sits between the agent's shell commands and the model, filters out the noise, and hands back a compact version. It's a single Rust binary, so installation is boring in the good way:

brew install rtk
rtk init -g       # installs the hook and an RTK.md, for Claude Code

Then restart Claude Code. That init step is the "adjust some Claude settings" part. It drops a PreToolUse hook that quietly rewrites your agent's Bash commands. When Claude runs git status, the hook turns it into rtk git status before it executes, and the agent gets the trimmed output without knowing anything changed.

Under the hood it leans on four ideas: strip the boilerplate, group similar items together, truncate the parts that don't carry signal, and collapse repeated lines into counts. The effect on something chatty like a push is dramatic:

# git push, raw
Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 8 threads
... about a dozen more lines ...

# rtk git push
ok main

The agent didn't need any of the enumeration chatter. It needed to know the push worked and which branch it landed on.

In my own usage, rtk cut well over 80% off the tokens spent on CLI command output. I haven't hit a problem caused by it so far, but that number comes with an asterisk, and the asterisk is the whole next section.

One thing to know about scope.

The hook only fires on Bash tool calls. Claude Code's built-in Read, Grep, and Glob tools don't pass through it, so they aren't compressed automatically. If you want rtk's trimming there too, use shell commands (cat, rg, find) or call rtk read and rtk grep directly. Worth knowing so you're not confused when a big file read shows up at full size.

Where I hold rtk back.

There's a good study floating around where someone benchmarked these optimizers properly, and the conclusion stuck with me: they can be silently dangerous. The tool decides what counts as "junk," and one bad cut makes your agent quietly dumber without you ever seeing why. Nothing errors out. The answers just get a little worse, and you have no idea the filter ate the reason.

That risk isn't the same for every command, and this is where you get to be smart about it.

For something like git status, git log, or a generic log dump, the noise-to-signal ratio is terrible and the cost of a mistake is low. If the filter trims a git log a bit too aggressively, the agent asks again or moves on. Fine. So I let rtk handle those freely.

Test runners and linters feel like the opposite, and this is where it gets interesting. When a test fails, the exact assertion and the line number are the whole point. When a linter flags a rule on a specific file, that detail is what the agent is supposed to act on. Those are exactly the things an aggressive filter is tempted to drop, because to a compressor they read like repetitive structured noise. So the obvious move is to exclude tests outright. I didn't, and I'm glad I didn't.

The thing that saves you is the pass/fail asymmetry. A passing test run is pure ceremony, hundreds of lines of "ok" the model learns nothing from. A failing run is all signal. So instead of excluding tests, I lean on rtk's tee: passing runs collapse to a one-liner, and the moment something fails, rtk writes the full unfiltered output to disk so the failure detail survives intact and the agent reads it without re-running anything.

You'll find the config file location and its full structure in the rtk docs. The relevant snippet for the tee setting looks like this:

[tee]
enabled = true
mode = "failures"   # keep the full output on disk whenever something breaks

[hooks]
# the escape hatch: only list a command here if you catch rtk
# dropping something you actually needed
exclude_commands = []

That tee line is the real guard, not the exclude list. It's what lets me compress the noisiest commands I run without losing the one output that ever matters. If I catch rtk eating a detail I needed, the command goes in exclude_commands, but so far I haven't had to put anything there.

context-mode: keep raw tool output out of context entirely.

rtk handles shell output. context-mode goes after the other big offender: MCP tools.

Every MCP call tends to dump its raw payload straight into the conversation. A browser snapshot, twenty fetched issues, a fat access log. It piles up fast and you lose a chunk of your window to data you looked at once. context-mode is an MCP server that keeps that raw data out of the context window. It runs the work in a sandbox, stores the request and response bits in a local SQLite database, and surfaces only the slice that's actually relevant.

The mental model it pushes is "think in code." Instead of reading fifty files into context to count something, the agent writes a small script that does the counting and prints just the answer. One script replaces ten tool calls, and only the result lands in the chat.

It also does session continuity. As you work, it records file edits, git operations, tasks, and errors into SQLite. When the conversation compacts, it can rebuild that working state so the model picks up where you left off instead of asking you what you were doing.

Install on Claude Code is through the plugin marketplace:

/plugin marketplace add mksglu/context-mode
/plugin install context-mode@context-mode

Restart, then run /context-mode:ctx-doctor to confirm everything wired up.

Learn how to turn it off before you need to.

Here's the catch, and it's the reason I tell people to try it rather than blindly trust it. Because the raw MCP input and output never land in the chat, debugging an MCP tool gets painful. If your agent is calling some tool wrong, you can't just scroll up and read the exchange, because the exchange isn't there. It got sandboxed.

So learn how to disable context-mode before you actually need to. When an MCP integration is misbehaving, turn it off, debug the tool with the raw payloads visible, then turn it back on. If you wait until you're mid-incident to figure out the off switch, you'll be annoyed.

One more grain of salt: the project reports reductions up to 98%. Treat numbers like that as directional. They're self-reported and they swing hard depending on your workload. Measure your own.

Do you even need this?

Worth asking honestly. The harnesses are catching up on their own. Claude Code already compacts context automatically and clears stale tool results, so some of what these tools do may get absorbed into the platform over time.

But "premature optimization" is my usual warning, and the reason this clears that bar is the effort side, not the savings side. You're not rewriting anything or trading away readability. You install a binary, flip a couple of settings, and the work continues exactly as before, just cheaper and a bit clearer. When the cost of trying something is this low and the savings show up on day one, I'll take it.

Be careful what you cache.

Prompt caching deserves its own warning here, because it's the one optimization that can quietly work against you. The pitch sounds like free money: mark a stable chunk of the prompt, pay a small premium to write it once, then read it back at a tenth of the price on every later turn. The catch is what it nudges you to do. The bigger and more frozen your context, the more you save. And the bigger and more frozen your context, the faster it rots.

That's the trap. One day the agent starts missing things it caught an hour ago, and you decide the model got dumber. It didn't. You buried the signal under a pile of cheap, cached tokens and stopped noticing.

It cuts both ways, too. Too little context and the model guesses, because it doesn't have what it needs. Too much and it starts seeing patterns that aren't there, dragging stale facts from twenty turns ago into the answer and hallucinating off its own noise. No flag fixes that. The whole game is keeping it balanced, and caching pushes you toward "more" precisely because more is cheaper. Cheap is not the same as good.

So my honest take: don't hand-roll prompt caching unless you own the inference stack. Claude already caches the stable prefix for you, Codex does its own caching on the OpenAI side, and the rest are heading the same way. Unless you're running your own swarm of local models where you control the entire request structure, you're reinventing something the platform already does for free, and you inherit the rot along with it.

The caching worth your time is the safe kind: caching what your tools and MCP servers produce, not the conversation itself. That's exactly what rtk and context-mode do. They keep the volatile junk out of the live window instead of freezing a stale prefix in place. Compressing a git log or sandboxing a fat MCP payload can't poison the model's reasoning the way a bloated, lovingly cached context can. Cache the exhaust, not the conversation.

Numbers from my setup.

These are my own numbers, not the projects' headline figures. Two tools, two very different ways of counting, and I trust one of them more than the other.

rtk is the one I trust. Across roughly three thousand intercepted commands it saved about 83.5% of the tokens that command output would otherwise have spent. The surprising part was where the savings came from. Sort the breakdown by impact and the entire top of the list is one command: go test.

That's the command I was nervous about earlier, so here's the honest reconciliation. Running the suite is the noisiest thing I do, and on a passing run the output is pure ceremony the model gains nothing from, which is exactly why it compresses to almost nothing. The savings come entirely from the passes. The failures, which carry all the signal, get written out in full by tee. The command I was most tempted to exclude turned into my single biggest win, precisely because compressing it throws away only noise.

context-mode is the one to read with a grain of salt, because its own accounting is clearly off. Its lifetime totals are badly incomplete, undercounting my real usage by a wide margin. So I ignore the aggregate numbers it reports and look only at single sessions, and even then I trust what I see in my editor over what the dashboard claims.

The per-session results swing hard, and that swing is the real story. On a long Swift and iOS session, the kind where context had grown past 280k, context-mode reported keeping around 96% of the raw tool output out of the window, the difference between staying inside a 256k model and spilling over. On two smaller sessions, a Go code review and building a Jira reporting skill, it kept 0% out. Everything fit, so there was nothing to offload.

That's not the tool failing on the small ones. It's telling the truth: the benefit scales with how much raw tool output a session produces, and most of mine don't produce much. The exception is loud and consistent. In my experience, anything Xcode-related drags in far more context and tokens than my Go work or my Jira and Notion MCP calls, by a wide margin. So the iOS session is exactly where the offloading earned its keep, and the quieter sessions are where it sat idle and did no harm.

The number I actually believe isn't on any dashboard. It's that I almost never see context truncation anymore, and tasks that used to need a 1M-token window now sit comfortably inside a 256k model without it losing the plot. Most of that came from getting the MCP tool overhead out of the live context. That's the win, whatever the counter says.

The cannon: headroom.

There's a bigger weapon on the shelf, and it's worth knowing about even if you don't pick it up. headroom takes the opposite bet from the two tools above. Instead of one job done well, it tries to be the whole compression layer: it sits between your agent and the provider and compresses everything the model reads, tool output, logs, RAG chunks, files, and the conversation history itself, before any of it lands in the window.

The detail that surprised me is that it doesn't compete with rtk. It bundles it. headroom ships the rtk binary for shell-output rewriting and compresses everything downstream of that. So it's a superset, not an alternative. You run it as a library, as a proxy, or with a single headroom wrap claude, and it slots in front of the model.

A few things set it apart from the point tools. Its compression is reversible: it keeps the originals locally and lets the model pull them back on demand, which is a real safety net that rtk's failure-only fallback doesn't match. It aligns prefixes so the provider's own cache keeps hitting instead of getting busted every turn. And it carries a shared memory across Claude, Codex, and Gemini, so several agents can draw on the same context.

That last part is the tell for who it's actually for. This isn't a tool for one developer on one provider. It earns its keep when you're running multiple agents or multiple providers and want a single shared cache and memory underneath them, or when you own the inference stack and can justify the weight. And it's weight: a Python install, an ML compression model, a runtime to serve it, and an optional memory stack with real databases behind it. That's a platform, not a quick install. The aggressive, model-based compression is also the most likely to quietly drop something important, though the reversibility takes the edge off that risk.

Wrap up.

Context engineering for agentic coding comes down to a simple idea: stop paying to shuttle exhaust into the model. rtk trims the shell output, context-mode keeps raw MCP payloads out of the window, and between them you claw back both money and a sharper model.

Just be deliberate about what you let the filter throw away. Point the compression at the noisy, low-stakes stuff, keep your test and lint output honest, and know where the off switch is before you need it. Measure on your own workload, because the answer might surprise you.

As for the cannon, I'm leaving it on the shelf for now. On a single provider, rtk and context-mode already clear a decent chunk of the waste on their own, and that's where my attention goes: quick wins, low effort, nothing I have to babysit. I'll reach for headroom when the shape of the problem changes. Once I've squeezed local models far enough that building my own little swarm at home makes sense, or once I'm running enough providers that a shared cache and a shared context become the actual bottleneck, the weight will be worth it. Not today.

See you on the next one 👾

Sources:

19 Jun

2026