Until we solve the validation problem, none of this stuff is going to be more than flexes. We can au...

CuriouslyC • yesterday at 5:56 PM • 7 replies • view on HN

Until we solve the validation problem, none of this stuff is going to be more than flexes. We can automate code review, set up analytic guardrails, etc, so that looking at the code isn't important, and people have been doing that for >6 months now. You still have to have a human who knows the system to validate that the thing that was built matches the intent of the spec.

There are higher and lower leverage ways to do that, for instance reviewing tests and QA'ing software via use vs reading original code, but you can't get away from doing it entirely.

Replies

kaicianflone • yesterday at 6:16 PM

I agree with this almost completely. The hard part isn’t generation anymore, it’s validation of intent vs outcome. Especially once decisions are high-stakes or irreversible, think pkg updates or large scale tx

What I’m working on (open source) is less about replacing human validation and more about scaling it: using multiple independent agents with explicit incentives and disagreement surfaced, instead of trusting a single model or a single reviewer.

Humans are still the final authority, but consensus, adversarial review, and traceable decision paths let you reserve human attention for the edge cases that actually matter, rather than reading code or outputs linearly.

Until we treat validation as a first-class system problem (not a vibe check on one model’s answer), most of this will stay in “cool demo” territory.

➕ show 1 reply

bluesnowmonkey • yesterday at 11:26 PM

But, is that different from how we already work with humans? Typically we don't let people commit whatever code they want just because they're human. It's more than just code reviews. We have design reviews, sometimes people pair program, there are unit tests and end-to-end tests and all kinds of tests, then code review, continuous integration, Q&A. We have systems to watch prod for errors or user complaints or cost/performance problems. We have this whole toolkit of process and techniques to try to get reliable programs out of what you must admit are unreliable programmers.

The question isn't whether agentic coders are perfect. Actually it isn't even whether they're better than humans. It's whether they're a net positive contribution. If you turn them loose in that kind of system, surrounded by checks and balances, does the system tend to accumulate bugs or remove them? Does it converge on high or low quality?

I think the answer as of Opus 4.5 or so is that they're a slight net positive and it converges on quality. You can set up the system and kind of supervise from a distance and they keep things under control. They tend to do the right thing. I think that's what they're saying in this article.

stitched2gethr • yesterday at 11:38 PM

This is what we're working on at Speedscale. Our methods use traffic capture and replay to validate what worked before still works today.

dimitri-vs • yesterday at 11:08 PM

It's simple: you just offload the validation and security testing to the end user.

cronin101 • yesterday at 6:02 PM

This obviously depends on what you are trying to achieve but it’s worth mentioning that there are languages designed for formal proofs and static analysis against a spec, and I have suspicions we are currently underutilizing them (because historically they weren’t very fun to write, but if everything is just tokens then who cares).

And “define the spec concretely“ (and how to exploit emerging behaviors) becomes the new definition of what programming is.

➕ show 1 reply

varispeed • yesterday at 6:16 PM

AI also quickly goes off the rails, even the Opus 2.6 I am testing today. The proposed code is very much rubbish, but it passes the tests. It wouldn't pass skilled human review. Worst thing is that if you let it, it will just grow tech debt on top of tech debt.

alt Hacker News

Replies