logoalt Hacker News

59nadiryesterday at 5:37 AM5 repliesview on HN

We've seen public examples of where LLMs literally disable or remove tests in order to pass. I'm not sure having tests and asking LLMs to not merge things before passing them being "easy" matters much when the failure modes here are so plentiful and broad in nature.


Replies

jawigginsyesterday at 4:22 PM

You'd want to have the tests run as a github action and then fail the check if the tests don't pass. Optio will resume agents when the actions fail and tell them to fix the failures.

ElFitzyesterday at 7:00 AM

My favourite so far was Claude "fixing" deployment checks with `continue-on-error: true`

SR2Zyesterday at 6:58 PM

So... add another presubmit test that fails when a test is removed. Require human reviews.

It's not like a human being always pushes correct code, my risk assessment for an LLM reading a small bug and just making a PR is that thinking too hard is a waste of time. My risk assessment for a human is very similar, because actually catching issues during code review is best done by tests anyways. If the tests can't tell you if your code is good or not then it really doesn't matter if it's a human or an LLM, you're mostly just guessing if things are going to work and you WILL push bad code that gets caught in prod.

jamiemallersyesterday at 9:15 AM

[dead]

AbanoubRodolfyesterday at 6:53 AM

[dead]