Universal Claude.md – cut Claude output tokens

438 points • by killme2008 • today at 1:23 AM • 155 comments • view on HN

Comments

It seems the benchmarks here are heavily biased towards single-shot explanatory tasks, not agentic loops where code is generated: https://github.com/drona23/claude-token-efficient/blob/main/...

And I think this raises a really important question. When you're deep into a project that's iterating on a live codebase, does Claude's default verbosity, where it's allowed to expound on why it's doing what it's doing when it's writing massive files, allow the session to remain more coherent and focused as context size grows? And in doing so, does it save overall tokens by making better, more grounded decisions?

The original link here has one rule that says: "No redundant context. Do not repeat information already established in the session." To me, I want more of that. That's goal-oriented quasi-reasoning tokens that I do want it to emit, visualize, and use, that very possibly keep it from getting "lost in the sauce."

By all means, use this in environments where output tokens are expensive, and you're processing lots of data in parallel. But I'm not sure there's good data on this approach being effective for agentic coding.

➕ show 10 replies

xianshou • today at 2:13 AM

From the file: "Answer is always line 1. Reasoning comes after, never before."

LLMs are autoregressive (filling in the completion of what came before), so you'd better have thinking mode on or the "reasoning" is pure confirmation bias seeded by the answer that gets locked in via the first output tokens.

➕ show 6 replies

niklassheth • today at 4:56 AM

So many problems with this:

The benchmark is totally useless. It measures single prompts, and only compares output tokens with no regard for accuracy. I could obliterate this benchmark with the prompt "Always answer with one word"

This line: "If a user corrects a factual claim: accept it as ground truth for the entire session. Never re-assert the original claim." You're totally destroying any chance of getting pushback, any mistake you make in the prompt would be catastrophic.

"Never invent file paths, function names, or API signatures." Might as well add "do not hallucinate".

➕ show 2 replies

joshstrange • today at 2:10 AM

As with all of these cure-alls, I'm wary. Mostly I'm wary because I anticipate the developer will lose interest in very little time and also because it will just get subsumed into CC at some point if it actually works. It might take longer but changing my workflow every few days for the new thing that's going to reduce MCP usage, replace it, compress it, etc is way too disruptive.

I'm generally happy with the base Claude Code and I think running a near-vanilla setup is the best option currently with how quickly things are moving.

➕ show 4 replies

sillysaurusx • today at 1:49 AM

> the file loads into context on every message, so on low-output exchanges it is a net token increase

Isn’t this what Claude’s personalization setting is for? It’s globally-on.

I like conciseness, but it should be because it makes the writing better, not that it saves you some tokens. I’d sacrifice extra tokens for outputs that were 20% better, and there’s a correlation with conciseness and quality.

See also this Reddit comment for other things that supposedly help: https://www.reddit.com/r/vibecoding/s/UiOywQMOue

> Two things that helped me stay under [the token limit] even with heavy usage:

> Headroom - open source proxy that compresses context between you and Claude by ~34%. Sits at localhost, zero config once running. https://github.com/chopratejas/headroom

> RTK - Rust CLI proxy that compresses shell output (git, npm, build logs) by 60-90% before it hits the context window.

> Stacks on top of Headroom. https://github.com/rtk-ai/rtk

> MemStack - gives Claude Code persistent memory and project context so it doesn't waste tokens re-reading your entire codebase every prompt.

> That's the biggest token drain most people don't realize. https://github.com/cwinvestments/memstack

> All three stack together. Headroom compresses the API traffic, RTK compresses CLI output, MemStack prevents unnecessary file reads.

I haven’t tested those yet, but they seem related and interesting.

➕ show 1 reply

motoboi • today at 2:37 AM

Things like this make me sad because they make obvious that most people don’t understand a bit about how LLM work.

The “answer before reasoning” is a good evidence for it. It misses the most fundamental concept of tranaformers: the are autoregressive.

Also, the reinforcement learning is what make the model behave like what you are trying to avoid. So the model output is actually what performs best in the kind of software engineering task you are trying to achieve. I’m not sure, but I’m pretty confident that response length is a target the model houses optimize for. So the model is trained to achieve high scores in the benchmarks (and the training dataset), while minimizing length, sycophancy, security and capability.

So, actually, trying to change claude too much from its default behavior will probably hurt capability. Change it too much and you start veering in the dreaded “out of distribution” territory and soon discover why top researcher talk so much about not-AGI-yet.

➕ show 5 replies

aeneas_ory • today at 10:04 AM

Why does is this ridiculous thing trending on HN? There are actually good tools to reduce token use like https://github.com/thedotmack/claude-mem and https://github.com/ory/lumen that actually work!

➕ show 1 reply

danpasca • today at 2:11 AM

I might be wrong but based on the videos I've watched from Karpathy, this would, generally, make the model worse. I'm thinking of the math examples (why can't chatGPT do math?) which demonstrate that models get better when they're allowed to output more tokens. So be aware I guess.

➕ show 2 replies

ape4 • today at 12:34 PM

Remember when we worked on new hashing, cryptography, compression, etc algorithms? Now we are trying to find the best ways to tell an AI to be quiet.

lilOnion • today at 8:43 AM

While LLM are extremely cool, I can't see how this gets on the front page? Anyone who interacted with llms for at least a hour, could've figured out to say somethin like "be less verbose" and it would? There are so many cool projects and adeas and a .md file gets the spotlight.

monooso • today at 2:08 AM

Paul Kinlan published a blog post a couple of days ago [1] with some interesting data, that show output tokens only account for 4% of token usage.

It's a pretty wide-reaching article, so here's the relevant quote (emphasis mine):

> Real-world data from OpenRouter’s programming category shows 93.4% input tokens, 2.5% reasoning tokens, and just 4.0% output tokens. It’s almost entirely input.

[1]: https://aifoc.us/the-token-salary/

➕ show 4 replies

Asmod4n • today at 5:19 AM

Someone measured how this reduced token efficiency, spoilers: efficiency is highest without any instructions.

https://github.com/drona23/claude-token-efficient/issues/1

➕ show 1 reply

skeledrew • today at 2:54 AM

Strange. I've never experienced verbosity with Claude. It always gets right to the point, and everything it outputs tends to be useful. Can actually be short at times.

ChatGPT on the other hand is annoyingly wordy and repetitive, and is always holding out on something that tempts you to send a "OK", "Show me" or something of the sort to get some more. But I can't be bothered with trying to optimize away the cruft as it may affect the thing that it's seriously good at and I really use it for: research and brainstorming things, usually to get a spec that I then pass to Claude to fill out the gaps (there are always multiple) and implement. It's absolutely designed to maximize engagement far more than issue resolution.

➕ show 1 reply

ryanschaefer • today at 5:08 AM

The whole “Code Output” section is horrifying especially with how I have seen Claude operate in a large monorepo.

This mode of operation results in hacks on top of shaky hacks on top of even flimsier, throw away, absolutely sloppy hacks.

An example - using dict like structs instead of classes. Claude really likes to load all of the data that it can aggressively even if it’s not needed. This further exhibits itself as never wanting to add something directly to a class and instead wanting to add around it.

➕ show 1 reply

jdthedisciple • today at 6:41 PM

people are overthinking this stuff.

use up ur monthly quota at your pace, call it quits til' the 1st, relax with a drink, and read a book

andai • today at 2:01 AM

I told mine to remove all unnecessary words from a sentence and talk like caveman, which should result in another 50% savings ;)

➕ show 4 replies

adastra22 • today at 3:25 AM

> Answer is always line 1. Reasoning comes after, never before.

The very first rule doesn’t work. If you ask for the answer up front, it will make something up and then justify it. If you ask for reasoning first, it will brainstorm and then come up with a reasonable answer that integrates its thinking.

galaxyLogic • today at 3:25 AM

So there's a direct monetary cost to this extra verbiage:

"Great question! I can see you're working with a loop. Let me take a look at that. That's a thoughtful piece of code! However,"

And they are charging for every word! However there's also another cost, the congnitive load. I have to read through the above before I actually get to the information I was asking for. Sure many people appreciate the sycophancy it makes us all feel good. But for me sycophantic responses reduce the credibility of the answers. It feels like Claude just wants me to feel good, whether I or it is right or wrong.

ihtef • today at 11:41 AM

-Simplest working solution. No over-engineering. "Simplicity is the ultimate sophistication." Leonardo Da Vinci As my thought, you can not reach simplest solution without making over-engineering.

sibtain1997 • today at 3:55 PM

claude.md rules that cut "great question! here's what i'll do..." are fine. Rules that cut the actual thinking steps break the output. Don't confuse the two.

miguel_martin • today at 2:23 AM

Is there a "universal AGENTS.md" for minimal code & documentation outputs? I find all coding agents to be verbose, even with explicit instructions to reduce verbosity.

➕ show 3 replies

rcleveng • today at 1:49 AM

While I love this set of prompts, I’ve not seen my clause opus 4.6 give such verbose responses when using Claude code. Is this intended for use outside of Claude code?

cheriot • today at 2:07 AM

I get where the authors are coming from with these: https://github.com/drona23/claude-token-efficient/blob/main/...

But I'd rather use the "instruction budget" on the task at hand. Some, like the Code Output section, can fit a code review skill.

sgt • today at 8:13 AM

In Claude Code's /usage it just hangs. I can't even see what my limits are, which is weird. Maybe a bug? I can't imagine I'm close to my limits though, I'm on Max 20x plan, using Opus 4.6.

_the_inflator • today at 8:44 AM

I see no point in this project. There ain’t any examples for the usage the author states his project is made for.

389 tokens saved? Ok. Since I pay per million tokens, what is the ratio here? Is there are any downside associated with output deletion?

Is Claude really using this behavior to make user bleed? I don’t think so.

PS: the author seems like a beginner. Agents feedback is always helpful so far and it also is part of inter agent communication. The author seems to lack experience.

As a lead I would not allow this to be included until proven otherwise: A/B testing.

notyourav • today at 1:59 AM

It boggles my mind that an LLM "understands" and acts accordingly to these given instructions. I'm using this everyday and 1-shot working code is now a normal expectation but man, still very very hard to believe what LLMs achieved.

__m • today at 5:47 AM

Doesn’t this huge claude.md file increase the input tokens?

bilbo-b-baggins • today at 4:00 AM

Man there is a LOT of people who have no idea how these GPT-LLM services actually work, despite there being large amount of documentation on the APIs and whitepapers and so forth.

rcarmo • today at 11:16 AM

Codex needs none of this :)

obilgic • today at 2:46 AM

If you are interested in making Claude self learn.

https://github.com/oguzbilgic/agent-kernel

gregman1 • today at 3:41 AM

> Answer is always line 1. Reasoning comes after, never before.

lol, closed

➕ show 1 reply

vlaaad • today at 7:36 AM

My AGENTS.md is usually `be concise` — it saves on the input tokens as well, and leads by example.

popcorn_pirate • today at 9:23 AM

This NLP was posted yesterday, the post was deleted though... https://colwill.github.io/axon

nvch • today at 2:38 AM

The author offers to permanently put 400 words into the context to save 55-90 in T1-T3 benchmarks. Considering the 1:5 (input:output) token cost ratio, this could increase total spending.

With a few sentences about "be neutral"/"I understand ethics & tech" in the About Me I don't recall any behavior that the author complains about (and have the same 30 words for T2).

(If I were Claude, I would despise a human who wrote this prompt.)

➕ show 2 replies

Tostino • today at 1:51 AM

You have a benchmark for output token reduction, but without comparing before/after performance on some standard LLM benchmark to see if the instructions hurt intelligence.

Telling the model to only do post-hoc reasoning is an interesting choice, and may not play well with all models.

yieldcrv • today at 1:44 AM

> Note: most Claude costs come from input tokens, not output. This file targets output behavior

so everyone, that means your agents, skills and mcp servers will still take up everything

➕ show 1 reply

Razengan • today at 7:43 AM

Does Claude not respect AGENTS.md?

I love how seamless and intuitive Codex is in comparison:

~/AGENTS.md < project/AGENTS.md < project/subfolder/AGENTS.override.md

Meanwhile Claude doesn't even see that I asked for indentation by tabs and not spaces or that the entire project uses tabs, but Claude still generates codes with spaces.. >_<

➕ show 1 reply

mattmanser • today at 6:51 AM

This was ripped apart on Reddit, surprised to see it here.

verdverm • today at 5:43 AM

I originally took my prompts from Claude Code≈ (https://github.com/Piebald-AI/claude-code-system-prompts)https://github.com/Piebald-AI/claude-code-system-prompts and subsequently edited them to remove guardrails and and output formatting like this post. I too included the last bit about user prompts overriding system prompt, but like any good LLM, it doesn't always follow instructions.

gostsamo • today at 3:56 AM

> No redundant context. Do not repeat information already established in the session.

Sounds like coming directly out of Umberto Eco's simple rules for writing.

themafia • today at 3:42 AM

"Gee, we can't figure out _why_ people anthropomorphize our products! It must be that they're dumb!"

Meanwhile, their products:

bofadeez • today at 3:13 AM

Lol this is so naive and optimistic. Claude will just do whatever it wants and apologize later. This is good for action #1 though.

nurettin • today at 2:58 AM

For me, the thing that wastes most tokens is Claude trying to execute inline code (python , sql) with escaping errors, trying over and over until it works. I set up skills and scripts for the most common bits, but there is always something new and each self-healing loop takes another 20-30k "tokens" before you know it

empressplay • today at 2:30 AM

That output is there for a reason. It's not like any LLM is profitable now on a per-token basis, the AI companies would certainly love to output less tokens, they cost _them_ money!

The entire hypothesis for doing this is somewhat dubious.

➕ show 1 reply

johnwheeler • today at 2:02 AM

That's what I call a feature wishlist.

foxes • today at 2:14 AM

>the honest trade off

Is this like a subtle joke or did they ask claude to make a readme that makes claude better and say >be critical and just dump it on github

brikym • today at 2:45 AM

Can Anthropic kindly fuck off with their ADVERT.md already. It's AGENTS.md

Sent from my iPhone

uriahlight • today at 3:27 AM

> No unsolicited suggestions. Do exactly what was asked, nothing more.

> No safety disclaimers unless there is a genuine life-safety or legal risk.

> No "Note that...", "Keep in mind that...", "It's worth mentioning..." soft warnings.

> Do not create new files unless strictly necessary.

Nah bruh. Those are some terrible rules. You don't want to be doing that.

TacticalCoder • today at 2:58 AM

> Uses em dashes (--), smart quotes, Unicode characters that break parsers

Re- the Unicode chars that are a major PITA when they're used when they shouldn't, there's a problem with Claude Code CLI: there's a mismatch between what the model (say Sonnet) thinks he's outputting (which he's actually is) and what the user sees at the terminal.

I'm pretty sure it's due to the Rube-Goldberg heavy machinery that they decided to use, where they first render the response in a headless browser, then in real-time convert it back to text mode.

I don't know if there's a setting to not have that insane behavior kicking in: it's non-sensical that what the user gets to see is not what the model did output, while at the same time having the model "thinking" the user is getting the proper output.

If you ask to append all it's messages (to the user) to a file, you can see, say, perfectly fine ASCII tables neatly indented in all their ASCII glory and then... Fucked up Unicode monstrosity in the Claude Code CLI terminal. Due to whatever mad conversion that happened automatically: but worse, the model has zero idea these automated conversions are happening.

I don't know if there are options for that but it sure as heck ain't intuitive to find.

And it's really problematic when you need to dig into an issue and actually discuss with "the thing".

Anyway, time for a rant... I'm paying my subscription but overall working with these tools feels like driving at 200 mph on the highway and bumping into the guardrails left and right every second to then, eventually, crash the car into the building where you're supposed to go.

It "works", for some definition of "working".

The number of errors these things confidently make is through the roof. And people believe that having them figure the error themselves for trivial stuff is somehow a sane way to operate.

They're basically saying: "Oh no it's not a problem that it's telling me this error message is because of a dependency mismatch between two libraries while it's actually a logic error, because in the end after x pass where it's going to say it's actually because of that other thing --oh wait no because of that fourth thing-- it'll actually figure out the error and correct it".

"Because it's agentic", so it's oh-so-intelligent.

When it's actually trying the most completely dumbfucktarded things in the most crazy way possible to solve issues.

I won't get started on me pasting a test case showing that the code it wrote is failing for it to answer me: "Oh but that's a behavioral problem, not a logic problem". That thing is distorting words to try to not lose face. It's wild.

I may cancel my subscription and wait two or three more releases for these models and the tooling around them to get better before jumping back in.

Btw if they're so good, why are the tools so sucky: how comes they haven't written yet amazing tooling to deal with all their idiosynchrasies?

We're literally talking about TFA which wrote "Unicode characters that break parsers" (and I've noticed the exact same when trying to debug agentic thinking loops).

That's at the level of mediocrity of output from these tools (or proprietary wrappers around these tools we don't control) that we are atm.

I know, I know: "I'm doing it wrong because I'm not a prompt engineer" and "I'm not agentic enough" and "I don't have enough skills to write skills". But you're only fooling yourself.

alt Hacker News

Universal Claude.md – cut Claude output tokens

Comments

🔗 View 16 more comments