> Furthermore, we observe that input tokens consistently constitute the largest share of consumption for an average of 53.9%
I'm seeing a ratio of around 10:1 in my usage. A vast majority of the tokens consumed are on the input side. The agent will often read a million tokens just to patch one line of code.
I think if you are seeing something closer to 1:1 or more on the output side, there is either a problem with the agent or the codebase is new / empty.
Did you experiment with giving agent better tools to navigate and document the codebase? Asts, language servers and so on?
A million tokens (not cached) sounds like a lot.
If input tokens dominate the cost to that extent, this implies that major gains are possible by making better use of caching. You could basically ask the model to do a one-time "compaction" step including a dump of the relevant portions of the code, and use that as the cached prefix for a large amount of "swarm" subagent calls.