Who's going to review that output for accuracy? We'll leave performance and security as un...

wg0 • yesterday at 10:45 PM • 5 replies • view on HN

Who's going to review that output for accuracy? We'll leave performance and security as unnecessary luxuries in this age and time.

In my experience, even Claude 4.6's output can't be trusted blindly it'll write flawed code and would write tests that would be testing that flawed code giving false sense of confidence and accomplishment only to be revealed upon closer inspection later.

Additionally - it's age old known fact that code is always easier to write (even prior to AI) but is always tenfold difficult to read and understand (even if you were the original author yourself) so I'm not so sure this much generative output from probabilistic models would have been so flawless that nobody needs to read and understand that code.

Too good to be true.

Replies

aenis • today at 8:47 AM

I am not sure how others are doing this, but here is our process:

- meaningful test coverage

- internal software architecture was explicitly baked into the prompts, and we try to not go wild with vibing, but, rather, spec it well, and keep Claude on a short leash

- each feature built was followed by a round of refactoring (with Claude, but with an oversight of an opinionated human). we spend 50% building, 50% refactoring, at least. Sometimes it feels like 30/70%. Code quality matters to us, as those codebases are large and not doing this leads to very noticeable drop in Claude's perceived 'intelligence'.

- performance tests as per usual - designed by our infra engineers, not vibed

- static code analysis, and a hierarchical system of guardrails (small claude.md + lots of files referenced there for various purposes). Not quite fond of how that works, Claude has been always very keen to ignore instructions and go his own way (see: "short leash, refactor often").

- pentests with regular human beings

The one project I mentioned - 2 months for a complete rewrite - was about a week of working on the code and almost 2 months spent on reviews, tests, and of course some of that time was wasted as we were doing this for the first time for such a large codebase. The rewritten app is doing fine in production for a while now.

I can only compare the outputs to the quality of the outputs of our regular engineering teams. It compares fine vs. good dev teams, IMHO.

➕ show 2 replies

doh • yesterday at 11:46 PM

I don't want to defend LLM written code, but this is true regardless if code is written by a person or a machine. There are engineers that will put the time to learn and optimize their code for performance and focus on security and there are others that won't. That has nothing to do with AI writing code. There is a reason why most software is so buggy and all software has identified security vulnerabilities, regardless of who wrote it.

I remember how website security was before frameworks like Django and ROR added default security features. I think we will see something similar with coding agents, that just will run skills/checks/mcps/... that focus have performance, security, resource management, ... built in.

I have done this myself. For all apps I build I have linters, static code analyzers, etc running at the end of each session. It's cheapest default in a very strict mode. Cleans up most of the obvious stuff almost for free.

➕ show 1 reply

abustamam • yesterday at 11:32 PM

Well it's all tradeoffs, right? 6 months for 9 FTEs is 54 man months. 2 months for 2 FTEs is 4 man months. Even if one FTE spent two extra months perusing every line of code and reviewing, that's still 6 man months, resulting in almost 10x speed.

Let's say you dont review. Those two extra months probably turns into four extra months of finding bugs and stuff. Still 8 man months vs 54.

Of course this is all assuming that the original estimates were correct. IME building stuff using AI in greenfield projects is gold. But using AI in brownfield projects is only useful if you primarily use AI to chat to your codebase and to make specific scoped changes, and not actually make large changes.

➕ show 1 reply

vorticalbox • today at 7:00 AM

You write the tests then it has a source of truth to know when it’s not working.

yladiz • yesterday at 11:48 PM

Minor point: AI doesn’t write, it generates.

alt Hacker News

Replies