This is not how people use LLMs. If you ask one of these questions you’d get a longer answer, often grounded on the internet. I speculate that conditional on a smart human operator interpreting the results, such interpretations across vendors converge more often than this report makes it seem.
Even then, there can often be substantive disagreements based on context. Hence the need for even a mostly true or mostly false bucket.