Single-agent LLMs suck at long-running complex tasks.
We’ve open-sourced a multi-agent orchestrator that we’ve been using to handle long-running LLM tasks. We found that single LLM agents tend to stall, loop, or generate non-compiling code, so we built a harness for agents to coordinate over shared context while work is in progress.
How it works: 1. Orchestrator agent that manages task decomposition 2. Sub-agents for parallel work 3. Subscriptions to task state and progress 4. Real-time sharing of intermediate discoveries between agents
We tested this on a Putnam-level math problem, but the pattern generalizes to things like refactors, app builds, and long research. It’s packaged as a Claude Code skill and designed to be small, readable, and modifiable.
Use it, break it, tell me about what workloads we should try and run next!
I wonder if there’s a third camp that isn’t about agent count at all, but about decision boundaries.
At some point the interesting question isn’t whether one agent or twenty agents can coordinate better, but which decisions we’re comfortable fully delegating versus which ones feel like they need a human checkpoint.
Multi-agent systems solve coordination and memory scaling, but they also make it easier to move further away from direct human oversight. I’m curious how people here think about where that boundary should sit — especially for tasks that have real downstream consequences.
Great work! I like the approach of maximum freedom inside bounded blast radius and how you use code to encode policy.
It's a more of a black box with claude, at least with this you see the proof strategy and mistakes made by the model when it decomposes the problem. I think instead of Ralph looping you get something that is top-down. If models were smarter and context windows bigger i am sure complex tasks like this one would be simpler, but braking it down into sub agents and having a collective --"we already tried this strategy and it backtracked"-- intelligence is a nice way to scope a limited context window to an independent sub problem.
The first screen of your signup flow asks for "organization" - is that used as a username or as an organization name or both (I can't tell what if anything will be on the next screen)
If your registration process is eventually going to ask me for a username, can the org name and user name be the same?
What’s the failure mode you see with single-agent Claude Code on complex tasks? (looping, context drift, plan collapse, tool misuse?)
Can you add a license.txt file so we know we have permission to run this (eg MIT and GPL V3 are very different)
Cool, what’s a good first task to try this on where it’s likely to beat a single agent?
seems like it requires an API key to your proprietary Ensue memory system
“How does progress subscription work — are agents watching specific signals (test failures, TODO list, build status), or just a global feed?”
I feel like there's two camps:
* Throw more agents * Use something like Beads
I'm in the latter, I don't have infinite resources, I'd rather stick to one agent and optimize what it can do. When I hit my Claude Code limit, I stop, I use Claude Code primarily for side projects.