[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"$pC4ZxlFnru":3,"blog-post-context-engineering-for-claude-code-improving-token-efficiency":4},"email-5yiq2jmc",{"content":5,"created_at":6,"description":7,"id":8,"keywords":9,"reading_time":14,"slug":15,"title":16,"updated_at":17,"ok":18},"# Context engineering for Claude: improving token efficiency.\n\nToken bills are the new cloud bills. If you run an agentic coding tool all day, you've probably watched the number climb and wondered where it all went. A lot of it goes to noise: the agent runs `go test`, dumps four thousand tokens of passing-test noise into the context, re-reads the same 600-line file for the third time, then pastes another wall of `git status` output. None of that is the work. It's the exhaust.\n\nMost people reach for shorter prompts first. Wrong lever. The tokens aren't in your wording, they're in the exhaust, which makes token efficiency **a context-engineering problem more than a writing one**.\n\nI've been running two tools to cut that exhaust for a while now: [rtk](https:\u002F\u002Fgithub.com\u002Frtk-ai\u002Frtk) and [context-mode](https:\u002F\u002Fgithub.com\u002Fmksglu\u002Fcontext-mode). And you know how I love small changes with low effort that give huge improvements in optimization. Same idea as [compressing a Redis cache down by ~90% with zstd](https:\u002F\u002Fgozman.space\u002Fblog\u002Fshrinking-redis-cache-with-msgp-and-zstd-in-golang), except here the bytes you're saving are tokens. You install a thing, you adjust a setting or two, and the bill drops. This is one of those.\n\nA note on scope. Everything here is about the CLI version of Claude Code. All of these tools work with Gemini CLI and Codex too, but Claude Code is where I tested, so that's what I'll talk about.\n\n## It's not just the bill.\n\nThe obvious reason to trim tokens is cost. The less obvious reason is quality.\n\nLong, junk-filled context doesn't just sit there harmlessly until you hit the window limit. It actively drags the model down. Accuracy slides as the input grows, well before the window is full, and models are especially bad at using information buried in the middle of a long context. You feel it in practice: an agent that's read fifty files and three log dumps gets vague in a way it wasn't twenty minutes earlier.\n\nSo trimming the exhaust does two things at once. It saves money, and it keeps the model sharper for longer. That's a rare combination, and it's why I bother.\n\n## rtk: compress command output before it hits the model.\n\n`rtk` is a CLI proxy. It sits between the agent's shell commands and the model, filters out the noise, and hands back a compact version. It's a single Rust binary, so installation is boring in the good way:\n\n```bash\nbrew install rtk\nrtk init -g       # installs the hook and an RTK.md, for Claude Code\n```\n\nThen restart Claude Code. That `init` step is the \"adjust some Claude settings\" part. It drops a `PreToolUse` hook that quietly rewrites your agent's Bash commands. When Claude runs `git status`, the hook turns it into `rtk git status` before it executes, and the agent gets the trimmed output without knowing anything changed.\n\nUnder the hood it leans on four ideas: strip the boilerplate, group similar items together, truncate the parts that don't carry signal, and collapse repeated lines into counts. The effect on something chatty like a push is dramatic:\n\n```bash\n# git push, raw\nEnumerating objects: 5, done.\nCounting objects: 100% (5\u002F5), done.\nDelta compression using up to 8 threads\n... about a dozen more lines ...\n\n# rtk git push\nok main\n```\n\nThe agent didn't need any of the enumeration chatter. It needed to know the push worked and which branch it landed on.\n\nIn my own usage, `rtk` cut **well over 80%** off the tokens spent on CLI command output. I haven't hit a problem caused by it so far, but that number comes with an asterisk, and the asterisk is the whole next section.\n\n### One thing to know about scope.\n\nThe hook only fires on Bash tool calls. Claude Code's built-in `Read`, `Grep`, and `Glob` tools don't pass through it, so they aren't compressed automatically. If you want `rtk`'s trimming there too, use shell commands (`cat`, `rg`, `find`) or call `rtk read` and `rtk grep` directly. Worth knowing so you're not confused when a big file read shows up at full size.\n\n## Where I hold rtk back.\n\nThere's a [good study floating around](https:\u002F\u002Freddit.com\u002Fr\u002FClaudeCode\u002Fcomments\u002F1spiy8t\u002Ftoken_optimizers_for_ai_coding_agents_are\u002F) where someone benchmarked these optimizers properly, and the conclusion stuck with me: **they can be silently dangerous**. The tool decides what counts as \"junk,\" and one bad cut makes your agent quietly dumber without you ever seeing why. Nothing errors out. The answers just get a little worse, and you have no idea the filter ate the reason.\n\nThat risk isn't the same for every command, and this is where you get to be smart about it.\n\nFor something like `git status`, `git log`, or a generic log dump, the noise-to-signal ratio is terrible and the cost of a mistake is low. If the filter trims a `git log` a bit too aggressively, the agent asks again or moves on. Fine. So I let `rtk` handle those freely.\n\nTest runners and linters feel like the opposite, and this is where it gets interesting. When a test fails, the exact assertion and the line number are the whole point. When a linter flags a rule on a specific file, that detail is what the agent is supposed to act on. Those are exactly the things an aggressive filter is tempted to drop, because to a compressor they read like repetitive structured noise. So the obvious move is to exclude tests outright. I didn't, and I'm glad I didn't.\n\nThe thing that saves you is the pass\u002Ffail asymmetry. A passing test run is pure ceremony, hundreds of lines of \"ok\" the model learns nothing from. A failing run is all signal. So instead of excluding tests, I lean on `rtk`'s `tee`: passing runs collapse to a one-liner, and the moment something fails, rtk writes the full unfiltered output to disk so the failure detail survives intact and the agent reads it without re-running anything.\n\nYou'll find the config file location and its full structure in the [rtk docs](https:\u002F\u002Fwww.rtk-ai.app\u002Fdocs\u002Fgetting-started\u002Fconfiguration\u002F#config-file-location). The relevant snippet for the `tee` setting looks like this:\n\n```toml\n[tee]\nenabled = true\nmode = \"failures\"   # keep the full output on disk whenever something breaks\n\n[hooks]\n# the escape hatch: only list a command here if you catch rtk\n# dropping something you actually needed\nexclude_commands = []\n```\n\nThat `tee` line is the real guard, not the exclude list. It's what lets me compress the noisiest commands I run without losing the one output that ever matters. If I catch `rtk` eating a detail I needed, the command goes in `exclude_commands`, but so far I haven't had to put anything there.\n\n## context-mode: keep raw tool output out of context entirely.\n\n`rtk` handles shell output. `context-mode` goes after the other big offender: MCP tools.\n\nEvery MCP call tends to dump its raw payload straight into the conversation. A browser snapshot, twenty fetched issues, a fat access log. It piles up fast and you lose a chunk of your window to data you looked at once. `context-mode` is an MCP server that keeps that raw data out of the context window. It runs the work in a sandbox, stores the request and response bits in a local SQLite database, and surfaces only the slice that's actually relevant.\n\nThe mental model it pushes is \"think in code.\" Instead of reading fifty files into context to count something, the agent writes a small script that does the counting and prints just the answer. One script replaces ten tool calls, and only the result lands in the chat.\n\nIt also does session continuity. As you work, it records file edits, git operations, tasks, and errors into SQLite. When the conversation compacts, it can rebuild that working state so the model picks up where you left off instead of asking you what you were doing.\n\nInstall on Claude Code is through the plugin marketplace:\n\n```bash\n\u002Fplugin marketplace add mksglu\u002Fcontext-mode\n\u002Fplugin install context-mode@context-mode\n```\n\nRestart, then run `\u002Fcontext-mode:ctx-doctor` to confirm everything wired up.\n\n### Learn how to turn it off before you need to.\n\nHere's the catch, and it's the reason I tell people to try it rather than blindly trust it. Because the raw MCP input and output never land in the chat, debugging an MCP tool gets painful. If your agent is calling some tool wrong, you can't just scroll up and read the exchange, because the exchange isn't there. It got sandboxed.\n\nSo learn how to disable `context-mode` before you actually need to. When an MCP integration is misbehaving, turn it off, debug the tool with the raw payloads visible, then turn it back on. If you wait until you're mid-incident to figure out the off switch, you'll be annoyed.\n\nOne more grain of salt: the project reports reductions up to 98%. Treat numbers like that as directional. They're self-reported and they swing hard depending on your workload. Measure your own.\n\n## Do you even need this?\n\nWorth asking honestly. The harnesses are catching up on their own. Claude Code already compacts context automatically and clears stale tool results, so some of what these tools do may get absorbed into the platform over time.\n\nBut \"premature optimization\" is my usual warning, and the reason this clears that bar is the effort side, not the savings side. You're not rewriting anything or trading away readability. You install a binary, flip a couple of settings, and the work continues exactly as before, just cheaper and a bit clearer. When the cost of trying something is this low and the savings show up on day one, I'll take it.\n\n## Be careful what you cache.\n\nPrompt caching deserves its own warning here, because it's the one optimization that can quietly work against you. The pitch sounds like free money: mark a stable chunk of the prompt, pay a small premium to write it once, then read it back at a tenth of the price on every later turn. The catch is what it nudges you to do. The bigger and more frozen your context, the more you save. And the bigger and more frozen your context, the faster it rots.\n\nThat's the trap. One day the agent starts missing things it caught an hour ago, and you decide the model got dumber. It didn't. You buried the signal under a pile of cheap, cached tokens and stopped noticing.\n\nIt cuts both ways, too. Too little context and the model guesses, because it doesn't have what it needs. Too much and it starts seeing patterns that aren't there, dragging stale facts from twenty turns ago into the answer and hallucinating off its own noise. No flag fixes that. The whole game is keeping it balanced, and caching pushes you toward \"more\" precisely because more is cheaper. Cheap is not the same as good.\n\nSo my honest take: don't hand-roll prompt caching unless you own the inference stack. Claude already caches the stable prefix for you, Codex does its own caching on the OpenAI side, and the rest are heading the same way. Unless you're running your own swarm of local models where you control the entire request structure, you're reinventing something the platform already does for free, and you inherit the rot along with it.\n\nThe caching worth your time is the safe kind: caching what your tools and MCP servers produce, not the conversation itself. That's exactly what `rtk` and `context-mode` do. They keep the volatile junk out of the live window instead of freezing a stale prefix in place. Compressing a `git log` or sandboxing a fat MCP payload can't poison the model's reasoning the way a bloated, lovingly cached context can. **Cache the exhaust, not the conversation.**\n\n## Numbers from my setup.\n\nThese are my own numbers, not the projects' headline figures. Two tools, two very different ways of counting, and I trust one of them more than the other.\n\n`rtk` is the one I trust. Across roughly three thousand intercepted commands it **saved about 83.5% of the tokens** that command output would otherwise have spent. The surprising part was where the savings came from. Sort the breakdown by impact and the entire top of the list is one command: `go test`.\n\nThat's the command I was nervous about earlier, so here's the honest reconciliation. Running the suite is the noisiest thing I do, and on a passing run the output is pure ceremony the model gains nothing from, which is exactly why it compresses to almost nothing. The savings come entirely from the passes. The failures, which carry all the signal, get written out in full by `tee`. The command I was most tempted to exclude turned into my single biggest win, precisely because compressing it throws away only noise.\n\n`context-mode` is the one to read with a grain of salt, because its own accounting is clearly off. Its lifetime totals are badly incomplete, undercounting my real usage by a wide margin. So I ignore the aggregate numbers it reports and look only at single sessions, and even then I trust what I see in my editor over what the dashboard claims.\n\nThe per-session results swing hard, and that swing is the real story. On a long Swift and iOS session, the kind where context had grown past 280k, **context-mode reported keeping around 96%** of the raw tool output out of the window, the difference between staying inside a 256k model and spilling over. On two smaller sessions, a Go code review and building a Jira reporting skill, it kept 0% out. Everything fit, so there was nothing to offload.\n\nThat's not the tool failing on the small ones. It's telling the truth: the benefit scales with how much raw tool output a session produces, and most of mine don't produce much. The exception is loud and consistent. In my experience, anything Xcode-related drags in far more context and tokens than my Go work or my Jira and Notion MCP calls, by a wide margin. So the iOS session is exactly where the offloading earned its keep, and the quieter sessions are where it sat idle and did no harm.\n\nThe number I actually believe isn't on any dashboard. It's that I almost never see context truncation anymore, and tasks that used to need a 1M-token window now sit comfortably inside a 256k model without it losing the plot. Most of that came from getting the MCP tool overhead out of the live context. That's the win, whatever the counter says.\n\n## The cannon: headroom.\n\nThere's a bigger weapon on the shelf, and it's worth knowing about even if you don't pick it up. [headroom](https:\u002F\u002Fgithub.com\u002Fchopratejas\u002Fheadroom) takes the opposite bet from the two tools above. Instead of one job done well, it tries to be the whole compression layer: it sits between your agent and the provider and compresses everything the model reads, tool output, logs, RAG chunks, files, and the conversation history itself, before any of it lands in the window.\n\nThe detail that surprised me is that it doesn't compete with `rtk`. **It bundles it.** headroom ships the `rtk` binary for shell-output rewriting and compresses everything downstream of that. So it's a superset, not an alternative. You run it as a library, as a proxy, or with a single `headroom wrap claude`, and it slots in front of the model.\n\nA few things set it apart from the point tools. Its compression is reversible: it keeps the originals locally and lets the model pull them back on demand, which is a real safety net that `rtk`'s failure-only fallback doesn't match. It aligns prefixes so the provider's own cache keeps hitting instead of getting busted every turn. And it carries a shared memory across Claude, Codex, and Gemini, so several agents can draw on the same context.\n\nThat last part is the tell for who it's actually for. This isn't a tool for one developer on one provider. It earns its keep when you're running multiple agents or multiple providers and want a single shared cache and memory underneath them, or when you own the inference stack and can justify the weight. And it's weight: a Python install, an ML compression model, a runtime to serve it, and an optional memory stack with real databases behind it. That's a platform, not a quick install. The aggressive, model-based compression is also the most likely to quietly drop something important, though the reversibility takes the edge off that risk.\n\n## Wrap up.\n\nContext engineering for agentic coding comes down to a simple idea: stop paying to shuttle exhaust into the model. `rtk` trims the shell output, `context-mode` keeps raw MCP payloads out of the window, and between them you claw back both money and a sharper model.\n\nJust be deliberate about what you let the filter throw away. Point the compression at the noisy, low-stakes stuff, keep your test and lint output honest, and know where the off switch is before you need it. Measure on your own workload, because the answer might surprise you.\n\nAs for the cannon, I'm leaving it on the shelf for now. On a single provider, `rtk` and `context-mode` already clear a decent chunk of the waste on their own, and that's where my attention goes: quick wins, low effort, nothing I have to babysit. I'll reach for headroom when the shape of the problem changes. Once I've squeezed local models far enough that building my own little swarm at home makes sense, or once I'm running enough providers that a shared cache and a shared context become the actual bottleneck, the weight will be worth it. Not today.\n\nSee you on the next one 👾\n\nSources:\n\n- rtk: [https:\u002F\u002Fgithub.com\u002Frtk-ai\u002Frtk](https:\u002F\u002Fgithub.com\u002Frtk-ai\u002Frtk)\n- rtk docs: [https:\u002F\u002Fwww.rtk-ai.app\u002Fdocs\u002Fgetting-started\u002Fconfiguration\u002F#config-file-location](https:\u002F\u002Fwww.rtk-ai.app\u002Fdocs\u002Fgetting-started\u002Fconfiguration\u002F#config-file-location)\n- context-mode: [https:\u002F\u002Fgithub.com\u002Fmksglu\u002Fcontext-mode](https:\u002F\u002Fgithub.com\u002Fmksglu\u002Fcontext-mode)\n- headroom: [https:\u002F\u002Fgithub.com\u002Fchopratejas\u002Fheadroom](https:\u002F\u002Fgithub.com\u002Fchopratejas\u002Fheadroom)\n- Token optimizer study: [https:\u002F\u002Freddit.com\u002Fr\u002FClaudeCode\u002Fcomments\u002F1spiy8t\u002Ftoken_optimizers_for_ai_coding_agents_are\u002F](https:\u002F\u002Freddit.com\u002Fr\u002FClaudeCode\u002Fcomments\u002F1spiy8t\u002Ftoken_optimizers_for_ai_coding_agents_are\u002F)\n","2026-06-19T07:52:54.679023Z","Token efficiency in agentic coding is really a context-engineering problem: the cost driver isn't long prompts, it's bloated command output, idle tool schemas, and stale history. I run rtk and context-mode to cut that waste in Claude Code, and they make the model sharper while they're at it. How they work, where I let them cut and where I don't, why caching can backfire, and the all-in-one cannon I'm not reaching for yet.",9,[10,11,12,13],"ai","claude","llm","optimization",707,"context-engineering-for-claude-code-improving-token-efficiency","Context engineering for Claude Code: improving token efficiency","2026-06-19T08:18:26.987613Z",true]