logoalt Hacker News

noodletheworldtoday at 7:56 AM1 replyview on HN

> what's much better is having THE AGENT come up with end to end test scenarios

There is no difference between an agent writing playwright tests and writing unit tests.

End-to-end tests ARE TESTS.

You can call them 'scenarios'; but.. waves arms wildly in the air like a crazy person those are tests. They're tests. They assert behavior. That's what a test is.

It's a test.

Your 'levels of accuracy' are:

1. <-- no tests 2. <-- llm critic multi-pass on generated output 3. <-- the agent uses non-model tooling (lint, compilers) to self correct 4. <-- the agent writes tests 5. <-- the agent writes end-to-end tests 6. <-- a human does the testing

Now, all of these are totally irrelevant to your point other than 4 and 5.

> I can empirically show...

Then show it.

I don't believe you can demonstrate a meaningful difference between (4) and (5).

The point I've made has not misunderstood your point.

There is no meaningful difference between having an agent write 'scenario' end-to-end tests, and writing unit tests.

It doesn't matter if the scenario tests are in cypress, or playwright, or just a text file that you give to an LLM with a browser MCP.

It's a test. It's written by an agent.

/shrug


Replies

simianwordstoday at 8:15 AM

> Now, all of these are totally irrelevant to your point other than 4 and 5.

No it is completely relevant.

I don't have empirical proof for 4 -> 5 but I assume you agree that there is meaningful difference between 1 -> 4?

Do you disagree that an agent that simply writes code and uses a linter tool + unit tests is meaningfully different from an LLM that uses those tools but also uses the end product as a human would?

In your previous example

> Well, it could go, 'this is stupid, X-Country is not a thing, this feature is not implemented correctly'.

...but, it's far more likely it'll go 'I tried this with X-Country: America, and X-Country: Ukraine and no X-Country header and the feature is working as expected'.

I could easily disprove this. But I can ask you what's the best way to disprove?

"Well, it could go, 'this is stupid, X-Country is not a thing, this feature is not implemented correctly'"

How this would work in end to end test is that it would send the X-Country header for those blocked countries and it verifies that the feature was not really blocked. Do you think the LLM can not handle this workflow? And that it would hallucinate even this simple thing?

show 1 reply