Nice work once again from Ofir Press and team; this seems to be an idea that's in the air.
> Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task
Fwiw, this is very different from what we find in MirrorCode:
> Opus 4.6 successfully reimplements almost every program up to gotree’s size in our benchmark.
https://epoch.ai/blog/mirrorcode-preliminary-results
I don't have time right now to dig in to what could explain the difference (I'm working hard on getting the full MirrorCode out as soon as possible). But I suspect that the ProgramBench authors are either under-eliciting the AIs, or their tasks are unfair/impossible given the constraints, or both.
I hope to look more into it after releasing MirrorCode, and write up my conclusions.
Problem with these types of benchmarks is that it’s 100% certain the LLM has been trained on all that code already, so they’re all tainted since you don’t know whether it’s just benchmarking recall vs actual reasoning.
Same with SWE-bench and others.
Surely the biggest difference is that you guys are mostly testing LLMs on simpler utilities, mostly involving higher-level languages, whereas ProgramBench are all very complex C programs (and much older programs with much more comprehensive test cases).
Eg cal is totally routine. I would expect most sophomores to be able to write a perfectly good cal. In fact the only program you tested which actually has anywhere close to the complexity of SQLite or FFmpeg is is Pkl, and it looks like Opus 4.6 totally failed.
I think your results are consistent. You're just measuring different things. Your benchmarks mostly tests LLMs ability to write technically routine programs of moderate length - yes the bioinformatics package involves specialized domain knowledge, but not specialized Go engineering. ProgramBench is harder.
I would love to try this out. I have a horrible legacy project that is written in angular by a really amateur developer, full of huge blocks of copy pasted code that has minor modifications in each block. I’ve tried before to get an LLM to rewrite it to something more sensible, but I have not succeeded, usually it just ends up breaking everything. Is there a guide or some system to follow? What’s the best way to accomplish a task like this?