The point is that you have verify it yourself. Like you wrote: "check to see of the query produces what you want"
Otherwise the LLM can just write tests against whatever it wrote and not what is expected. This happens often with the top models too.
Someone needs to check the tests work, review they cover edge cases etc.