I think this sounds like a true yet short sighted take. Keep in mind these features are immature but...

aspenmartin • yesterday at 12:20 PM • 3 replies • view on HN

I think this sounds like a true yet short sighted take. Keep in mind these features are immature but they exist to obtain a flywheel and corner the market. I don’t know why but people seem to consistently miss two points and their implications

- performance is continuing to increase incredibly quickly, even if you rightfully don’t trust a particular evaluation. Scaling laws like chinchilla and RL scaling laws (both training and test time)

- coding is a verifiable domain

The second one is most important. Agent quality is NOT limited by human code in the training set, this code is simply used for efficiency: it gets you to a good starting point for RL.

Claiming that things will not reach superhuman performance, INCLUDING all end to end tasks: understanding a vague business objective poorly articulated, architecting a system, building it out, testing it, maintaining it, fixing bugs, adding features, refactoring, etc. is what requires the burden of proof because we literally can predict performance (albeit it has a complicated relationship with benchmarks and real world performance).

Yes definitely, error rates are too high so far for this to be totally trusted end to end but the error rates are improving consistently, and this is what explains the METR time horizon benchmark.

Replies

sobellian • yesterday at 1:13 PM

Scaling laws vs combinatorial explosion, who wins? In personal experience claude does exceedingly well on mundane code (do a migration, add a field, wire up this UI) and quite poorly on code that has likely never been written (even if it is logically simple for a human). The question is whether this is a quantitative or qualitative barrier.

Of course it's still valuable. A real app has plenty of mundane code despite our field's best efforts.

➕ show 1 reply

embedding-shape • yesterday at 12:29 PM

> - coding is a verifiable domain

You're missing the point though. "1 + 1" vs "one.add(1)" might both be "passable" and correct, but it's missing the forest for the trees, how do you know which one is "long-term the right choice, given what we know?", which is the engineering part of building software, and less about "coding" which tends to be the easy part.

How do you evaluate, score and/or benchmark something like that? Currently, I don't think we have any methodologies for this, probably because it's pretty subjective in the end. That's where the "creative" parts of software engineering becomes more important, and it's also way harder to verify.

➕ show 2 replies

nprateem • yesterday at 12:27 PM

But the issue isn't coding, it's doing the right thing. I don't see anywhere in your plan some way of staying aligned to core business strategy, forethought, etc.

The number of devs will reduce but there will still be large activities that can't be farmed out without an overall strategy

➕ show 1 reply

alt Hacker News

Replies