Math proofs are really easy to run with this specific harness. Our next experiments are going to be bigger, think full code base refactors. We're working on applying RLM to improve context window limits so we can keep more of the actual code in RAM,
Any workloads you want to see? The best are ones that have ways to measure the output being successful, thinking about recreating the C compiler example Anthropic did, but doing it for less than the $20k in tokens they used.
[dead]
Maybe I'm just not working on complex or big enough projects but I haven't encountered a case of a feature that couldn't be implemented in one or two context windows. Or using vanilla Claude Code a multi-phase plan doc with a couple of sub agents and a final verification pass with Codex.
I guess maybe I'm doing the orchestration manually, but I always find there's tons of decisions that need to be made in the middle of large plan implementations.
Your refactor example terrifies me because the best part of a refactor is cleaning out all the bandaid workarounds and obsolete business logic you didn't even know existed. Can't see how an agent swarm would be able to figure that out unless you provide a giga-spec file containing all current business knowledge. And if you don't spec it the agents will just eagerly bake these inefficiencies and problems into your migrated app.