LLMs are at their best when you have an expectation for their output. I generally know the shape of the correct response and that allows me to evaluate it's output on it's "vibes", rather than line by line. If there's no expectation then I have to take everything at face value and now I'm at the mercy of the machine.
I agree, but I would add that they can be very useful even if you do not have clear expectations but have some solid ways to verify their claims. Often in doing this verification I came up with new ideas.
Exactly, if I generate a large chunk software, I'm going to have expectations about what it will do, how it will do it, etc. You don't just accept the statement that "it's done" for fact but you start looking for evidence.
A scientific approach here is to look to falsify the statement. You start asking questions, running tests, experiments, etc. to prove the notion that it is done wrong. And at some point you run out of such tests and it's probably done for some useful notion of done-ness.
I've built some larger components and things with AI. It's never a one shot kind of deal. But the good news is that you can use more AI to do a lot of the evaluation work. And if you align your agents right, the process kind of runs itself, almost. Mostly I just nudge it along. "Did you think about X? What about Y? Let's test Z"