> Can they write production software without careful and close human supervision? Not yet. That's not disparagement, just an observation of where we are today.
I never claimed they could! I just view this as a successful experiment. I don't think anthropic was making that claim with their experiment either.
It feels reflexive to the moment to argue against that claim, but I tend to operate with a bit more nuance than "all good" or "all bad".
I think people are concerned about the large discrepancy in concrete claims in your previous comment and subsequent empirical information. You may have seen a headline or skimmed an article and missed some details, not a big deal.
The overall impression given was inaccurate and the implicit claim of a fully working end-to-end generated compiler was inaccurate. The headlines were incomplete in a way that was intentionally misleading. It was an interesting experiment and somewhat impressive but the claims were overblown. It happens.
The experiment failed to produce a workable C compiler despite 1. the job not being particularly hard, 2. the available specs and tests are of a completely higher class of quality than almost any software, not to mention the availability of other implementations that the model trained on.
You can call that a success (as it did something impresssive even though it failed to produce a workable C compiler) but my point in bringing this up was to show that today's models are not yet able to produce production software without close supervision, even when uncharacteristically good specs and hand-written tests exist.