You can see up-thread that the same model will produce different answers for different people or even from run to run.
That seems problematic for a very basic question.
Yes, models can be harnessed with structures that run queries 100x and take the "best" answer, and we can claim that if the best answer gets it right, models therefore "can solve" the problem. But for practical end-user AI use, high error rates are a problem and greatly undermine confidence.