logoalt Hacker News

jihadjihadyesterday at 2:41 PM5 repliesview on HN

I wish there was a little more color in the Testing and QA section. While I agree with this:

  > A comprehensive test suite is by far the most effective way to keep those features working.
there is no mention at all about LLMs' tendency to write tautological tests--tests that pass because they are defined to pass. Or, tests that are not at all relevant or useful, and are ultimately noise in the codebase wasting cycles on every CI run. Sometimes to pass the tests the model might even hardcode a value in a unit test itself!

IMO this section is a great place to show how we as humans can guide the LLM toward a rigorous test suite, rather than one that has a lot of "coverage" but doesn't actually provide sound guarantees about behavior.


Replies

tshaddoxyesterday at 3:39 PM

Do you have an example of the tautological tests you're referring to? What comes to mind to me is genuinely logically tautological tests, like "assert(true || expectedResult == actualResult)" which is a mistake I don't even expect modern AI coding tools to make. But I suspect you're talking about a subtler type of test which at first glance appears useful but actually isn't.

show 4 replies
john-tells-allyesterday at 2:56 PM

Yes. And, a bad test -- that passes because it's defined to pass -- is _much worse_ than no test at all. It makes you think an edge case is "covered" with a meaningful check.

Worse: once you have one "bad apple" in your pile of tests, it decreases trust in the _whole batch of tests_. Each time a test passes, you have to think if it's a bad test...

lbreakjaiyesterday at 4:48 PM

That's where mutation testing becomes even more valuable. If the test still passes after the code has been mutated, then you may want to look deeper, because it's a sign that the test is not good.

alkonautyesterday at 3:06 PM

This seems it should be very easy to validate. Force the AI to make minimal changes to the code under test, which makes a single (or as few as possible) test fail as a result. If it can't make a test fail at all, it should be useless.

show 2 replies
jeremyloy_wtyesterday at 3:01 PM

> we as humans can guide the LLM toward a rigorous test suite, rather than one that has a lot of "coverage" but doesn't actually provide sound guarantees about behavior.

I have a hard enough time getting humans to write tests like this…