logoalt Hacker News

wtallisyesterday at 11:36 PM3 repliesview on HN

> Tests can’t prevent this because for a test suite to cover all observable behavior, it would need to be more complex than the code. In which case, it wouldn’t be any easier for machine or human to understand.

I don't think "complex" is the right word here. A test suite would generally be more verbose than the implementation, but a lot of the time it can simply be a long list of input->output pairs that are individually very comprehensible and easily reviewable to a human. The hard part is usually discovering what isn't covered by the test case, rather than validating the correctness of the test cases you do have.


Replies

sarchertechtoday at 2:07 AM

At some point verbosity becomes complexity. If you’re talking all observable behavior the input and output pairs are likely to be quite verbose/complex.

Imagine testing a game where the inputs are the possible states of game, and the possible control inputs, and the outputs are the states that could result.

Of course very few human written programs require this level of testing, but if you are trying to prevent an a swarm of agents from changing observable behavior without human review, that’s what you’d need.

Even with simpler input output pairs, an AI tells you it added a feature and had to change 2,000 input/output pairs to do so. How do you verify that those were necessary to change, and how do you verify that you actually have enough cases to prevent the AI from doing something dumb?

Oops you didn’t have a test that said that items shouldn’t turn completely transparent when you drag them.

skydhashtoday at 2:07 AM

Code is like f(x)=ax+b. You test would be a list of (x,y) tuple. You don’t verify the correctness of your points because they come from some source that you hold as true. What you want is the generic solution (the theory) proposed by the formula. And your test would be just a small set of points, mostly to ensure that no one has changed the a and b parameters. But if you have a finite number of points, The AI is more likely to give you a complicated spline formula than the simple formula above. Unless the tokens in the prompts push it to the right domain space. (Usually meaning that the problem is solved already)

Real code has more dimensionality than the above example. Experts have the right keywords, but even then that’s a whole of dice. And coming up with enough sample test cases is more arduous than writing the implementation.

Unless there’s no real solution (dimensionality is high), but we have a lot of tests data with a lower dimensionality than the problem. This used to be called machine learning and we have metrics like accuracy for it.

wizzwizz4yesterday at 11:38 PM

If some of those input-output pairs are the result of a different interpretation of the spec to other input-output pairs, it's possible that no program satisfies all the tests (or, worse, that a program that satisfies all the tests isn't correct).