logoalt Hacker News

tadamcztoday at 11:23 AM0 repliesview on HN

I don't think so. ProgramBench authors say no LLMs fully resolve any task, i.e. even the easiest tasks in their benchmark are unsolved. Whereas we found Opus 4.6 successfully reimplements almost every program up to gotree’s size (around 15-20 of them).

For Pkl, the preliminary results only went up to 1bn total tokens (costing $550, which would be cheap if LLMs could do the task). It might very well be solved at higher token budgets; see the report for more discussion of this.

The preliminary results are just on 4 targets. We have several Pkl-level and harder tasks in the full set which we're releasing soon.

In the following quote multiple things are not quite right:

> mostly involving higher-level languages, whereas ProgramBench are all very complex C programs (and much older programs with much more comprehensive test cases).

First, as I said above I think you're confusing the top-end of ProgramBench difficulty with the average. The quote in the OP is pretty clear that FFmpeg, SQLite, and PHP are the 3 hardest out of 200 in ProgramBench, and the bottom end is "compact CLI tools".

Second, I don't see the relevance of C vs higher-level languages, how does this make ProgramBench harder?

Third, for the test cases, I think you might be labouring under a misapprehension about how MirrorCode works? MirrorCode uses end-to-end tests from a variety of sources (the original program’s test suites, real-world data, and LLM-assisted generation). End-to-end means the stdout/stderr has to match exactly for each test case.