The other day I came across to a video showing workers in a e-vape factory. They pick up a bunch of e-vapes from the conveyor belt (each has 6 e-vape think), stick in their mouth and vigorously vape all of them for about 5 seconds, then test the next bunch. Humans reviewing hundreds of lines of change in a PR written by AI is not very different.
Very true. If a PR has 1000 lines I would check only a handful full of them and leave the rest for test suit .
You can do statistical testing of the e-vape line because you have a specific criteria and well defined tolerances that you can define on a per-sample basis and that the factory meets with some acceptable 9s of reliability.
PRs are not like this because a single bad PR can be catastrophic for your business in a way that a single bad e-vape cannot.
I would also argue that the current output from the AIs when sampled by software engineers regularly doesn't meet the bar of quality we want in our product, hence the need to review every PR and fix a substantial fraction.
If you can start to bound the impact of changes and the outputs begin to be generally acceptable unsupervised, such that all you're doing is double checking that nothing has regressed in the factory, then the sampling approach can work.