logoalt Hacker News

simianwordstoday at 8:15 AM1 replyview on HN

> Now, all of these are totally irrelevant to your point other than 4 and 5.

No it is completely relevant.

I don't have empirical proof for 4 -> 5 but I assume you agree that there is meaningful difference between 1 -> 4?

Do you disagree that an agent that simply writes code and uses a linter tool + unit tests is meaningfully different from an LLM that uses those tools but also uses the end product as a human would?

In your previous example

> Well, it could go, 'this is stupid, X-Country is not a thing, this feature is not implemented correctly'.

...but, it's far more likely it'll go 'I tried this with X-Country: America, and X-Country: Ukraine and no X-Country header and the feature is working as expected'.

I could easily disprove this. But I can ask you what's the best way to disprove?

"Well, it could go, 'this is stupid, X-Country is not a thing, this feature is not implemented correctly'"

How this would work in end to end test is that it would send the X-Country header for those blocked countries and it verifies that the feature was not really blocked. Do you think the LLM can not handle this workflow? And that it would hallucinate even this simple thing?


Replies

noodletheworldtoday at 8:28 AM

> it would send the X-Country header for those blocked countries and it verifies that the feature was not really blocked.

There is no reason to presume that the agent would successfully do this.

You haven't tried it. You don't know. I haven't either, but I can guarantee it would fail; it's provable. The agent would fail at this task. That's what agents do. They fail at tasks from time to time. They are non-deterministic.

If they never failed we wouldn't need tests <------- !!!!!!

That's the whole point. Agents, RIGHT NOW, can generate code, but verifying that what they have created is correct is an unsolved problem.

You have not solved it.

All you are doing is taking one LLM, pointing at the output of the second LLM and saying 'check this'.

That is step 2 on your accuracy list.

> Do you disagree that an agent that simply writes code and uses a linter tool + unit tests is meaningfully different from an LLM that uses those tools but also uses the end product as a human would?

I don't care about this argument. You keep trying to bring in irrelevant side points to this argument; I'm not playing that game.

You said:

> I can empirically show you that this spectrum works as such.

And:

> I don't have empirical proof for 4 -> 5

I'm not playing this game.

What you are, overall, asserting, is that END-TO-END tests, written by agents are reliable.

-

They. are. not.

-

You're not correct, but you're welcome to believe you are.

All I can say is, the burden of proof is on you.

Prove it to everyone by doing it.