You may want to look over this thread from cperciva: https://x.com/cperciva/status/2029645027358495156
I too tried Codex and found it similarly hard to control over long contexts. It ended up coding an app that spit out millions of tiny files which were technically smaller than the original files it was supposed to optimize, except due to there being millions of them, actual hard drive usage was 18x larger. It seemed to work well until a certain point, and I suspect that point was context window overflow / compaction. Happy to provide you with the full session if it helps.
I’ll give Codex another shot with 1M. It just seemed like cperciva’s case and my own might be similar in that once the context window overflows (or refuses to fill) Codex seems to lose something essential, whereas Claude keeps it. What that thing is, I have no idea, but I’m hoping longer context will preserve it.
Please don't post links with tracking parameters (t=jQb...).
What’s the connection with context size in that thread? It seems more like an instruction following problem.