logoalt Hacker News

christinetyipyesterday at 4:49 PM2 repliesview on HN

Cool, what’s a good first task to try this on where it’s likely to beat a single agent?


Replies

austinbaggioyesterday at 5:06 PM

Math proofs are really easy to run with this specific harness. Our next experiments are going to be bigger, think full code base refactors. We're working on applying RLM to improve context window limits so we can keep more of the actual code in RAM,

Any workloads you want to see? The best are ones that have ways to measure the output being successful, thinking about recreating the C compiler example Anthropic did, but doing it for less than the $20k in tokens they used.

show 2 replies
miligaussyesterday at 5:06 PM

we tried putnam a2