what am i missing? you take a spec and create tests, every little thing you use another ai to ve...

franze • today at 2:31 PM • 7 replies • view on HN

what am i missing?

you take a spec and create tests, every little thing

you use another ai to verify these tests against the spec

you review the tests vs the spec (at one point human review)

you put the tests off limits to change / wall them.

you let the ai write the software that fulfills the tests.

there will be some gaps where you repeat the cycle above

if the tests fulfill the spec, the code will fulfill the spec

Replies

torben-friis • today at 2:51 PM

>you take a spec and create tests, every little thing

A spec detailed enough and unambiguous enough to be translated into machine execution deterministically is called code.

Unlike a compiler, AI can build with a spec that is not detailed enough or unambiguous enough: It does so by filling in the gaps with educated guesses.

This is safe if and only if you take the time to later read the output, understand what its guesses were, and judge wether they were acceptable. No AI can do this for you because the truth lies in your original intentions, which it does not have access to.

The jury is out there on how reliable and time consuming this is vs writing the code yourself; it is not immediately obvious that is faster or requires a smaller cognitive load.

➕ show 1 reply

steveBK123 • today at 2:35 PM

If each step requires micro-steps iterating with an LLM with human review to prevent hallucinations creeping in.. at some point you might just be better off letting the human do the work.

Particularly as tokenmaxxing has ended and people are being charged more economic prices. If the pricing 5-10x the way Uber,etc did on the path to profitability.. even more so.

➕ show 1 reply

officialchicken • today at 2:36 PM

IME, regulatory compliance is something you are rarely able to test for in a nice little box or with well-known suite. So there's no easy "this complies" in many situations, no matter how many lawyers, compliance officers, and llm's you run it past.

➕ show 1 reply

hedora • today at 3:58 PM

I walked down that path for a few months. The more you constrain LLM's, the more underhanded they behave in order to produce something that satisfies all the constraints.

Doing the above doesn't actually make the model smarter, so, if it couldn't get to correct code with fewer steps, then the light you see at the end of the tunnel is an oncoming train.

sigbottle • today at 2:45 PM

This is such an abstract principle that the principle itself cannot be refuted. The plan sounds fine on paper. "Just iterate bro". But it entirely depends on what rational agents you put into the system. Obviously, if I sub in a 5 year old child everywhere, this loop breaks. Humans and AI, sometimes one is better than the other at certain things, we're still learning.

The only way to test this is to test it out, in real life. Sometimes people see results, sometimes people don't. Note that yes, I am including the entire iteration process - even after iterating, people still don't see results with AI.

I have had both positive and negative experiences with AI, over multi-week projects. But apparently on hackernews, anything positive about AI is proof that AI is superhuman and taking over, and all follies about AI are lies by stupid humans who secretly have psychological dispositions to fear AI. Sometimes the AI genuinely isn't good enough. Are we not allowed to say that now? We might not know why, but it's just the truth.

The other solution is to formally analyze the entire space of possible actions the agent can take a priori. Then yes, you can definitively say whether or not the principle breaks or not. Can you, though? Can you give a formal specification for the space of possible actions for AI and show that your loop never breaks, or breaks less than humans, or any other sensible criteria? If not, then you can't just give an abstract principle and start making inferences from that.

bobkb • today at 4:18 PM

It’s impossible to write a spec that’s not ambiguous , complete and correct in natural languages. Thus prompts will always generate unreliable software.

alt Hacker News

Replies