All of the above. The most frustrating one with the Putnam example with Claude was generating solutions that obviously didn't compile. This feels like plan collapse- not verifying its own work. I'm sure that if you just had a dumb two-model setup, it would eventually get to compiling code after n runs, but that was just for this one failure mode.
All of the above. The most frustrating one with the Putnam example with Claude was generating solutions that obviously didn't compile. This feels like plan collapse- not verifying its own work. I'm sure that if you just had a dumb two-model setup, it would eventually get to compiling code after n runs, but that was just for this one failure mode.