Curious how they do a “blind” preference test. To any evaluator I’m sure it’s quite clear which answer is AI vs human.