>This is no different from having an LLM pair where the first does something and the second one reviews it to “make sure no hallucinations”.
Absolutely not! This means you have not understood the point at all. The rest of your comment also suggests this.
Here's the real point: in scenario testing, you are relying on feedback from the environment for the LLM to understand whether the feature was implemented correctly or not.
This is the spectrum of choices you have, ordered by accuracy
1. on the base level, you just have an LLM writing the code for the feature
2. only slightly better - you can have another LLM verifying the code - this is literally similar to a second pass and you caught it correctly that its not that much better
3. what's slightly better is having the agent write the code and also give it access to compile commands so that it can get feedback and correct itself (important!)
4. what's even better is having the agent write automated tests and get feedback and correct itself
5. what's much better is having the agent come up with end to end test scenarios that directly use the product like a human would. maybe give it browser access and have it click buttons - make the LLM use feedback from here
6. finally, its best to have a human verify that everything works by replaying the scenario tests manually
I can empirically show you that this spectrum works as such. From 1 -> 6 the accuracy goes up. Do you disagree?
> what's much better is having THE AGENT come up with end to end test scenarios
There is no difference between an agent writing playwright tests and writing unit tests.
End-to-end tests ARE TESTS.
You can call them 'scenarios'; but.. waves arms wildly in the air like a crazy person those are tests. They're tests. They assert behavior. That's what a test is.
It's a test.
Your 'levels of accuracy' are:
1. <-- no tests 2. <-- llm critic multi-pass on generated output 3. <-- the agent uses non-model tooling (lint, compilers) to self correct 4. <-- the agent writes tests 5. <-- the agent writes end-to-end tests 6. <-- a human does the testing
Now, all of these are totally irrelevant to your point other than 4 and 5.
> I can empirically show...
Then show it.
I don't believe you can demonstrate a meaningful difference between (4) and (5).
The point I've made has not misunderstood your point.
There is no meaningful difference between having an agent write 'scenario' end-to-end tests, and writing unit tests.
It doesn't matter if the scenario tests are in cypress, or playwright, or just a text file that you give to an LLM with a browser MCP.
It's a test. It's written by an agent.
/shrug