This is fascinating, and aligned with my experience working on harnesses. I bet there is still significant upside left on the table with that same model.
There was a narrative last year by Anthropic that each new model release had them making the harness closer to a simple while loop with tools, but now it seems to be going in the other direction. There's just so much to explore with harnesses. Rolling context windows (instead of compaction) have been very powerful in my work with agentic harnesses, while keeping a persistent high level summary and a detailed automated feedback pipeline (granted, this is easier said than done if you don't have specific, consistent goals for your agent like we do).