Yes Grok gets it right even when told to not use web search. But the answer I got from the fast model is nonsensical. It recommends to drive because you'd not save any time walking and because "you'd have to walk back wet". The thinking-fast model gets it correct for the right reasons every time. Chain of thought really helps in this case.
Interestingly, Gemini also gets it right. It seems to be better able to pick up on the fact it's a trick question.
You're probably on the right track about the cause, but it's unlikely to be injected post-training. I'd expect post-training to help improve the situation. The problem starts with the training set. If you just train an LLM on the internet you get extreme far left models. This problem has been talked about by all the major labs. Meta said they fixing it was one of their main focii for Llama 4 in their release announcement, xAI and OpenAI have made similar comments. Probably xAI team have just done a lot more to clean the data set.
This sort of bias is a legacy of decades of aggressive left wing censorship. Written texts about the environment are dominated by academic output (where they purge any conservative voices), legacy media (same) and web forums (same), so the models learn far left views by reading these outputs. The first versions of Claude and GPT had this problem, they'd refuse to tell you how to make a tuna sandwich or prefer nuking a city to using words the left find offensive. Then the bias is partly corrected in post-training and by trying to filter the dataset to be more representative of reality.
Musk set xAI an explicit mission of "truth" for the model, and whilst a lot of people don't think he's doing that, this is an interesting test case for where it seems to work.
Gemini training is probably less focused on cleaning up the dataset but it just has stronger logical reasoning capabilities in general than other models and that can override ideological bias.
Thanks, I did not know about that pre-training bias. This does make sense.
Can you draw the connection more explicitly between political biases in LLMs (or training data) and common-sense reasoning task failures? I understand that there are lots of bias issues there, but it's not intuitive to me how this would lead to a greater likelihood of failure on this kind of task.
Conversely, did labs that tried to counter some biases (or change their directions) end up with better scores on metrics for other model abilities?
A striking thing about human society is that even when we interact with others who have very different worldviews from our own, we usually manage to communicate effectively about everyday practical tasks and our immediate physical environment. We do have the inferential distance problem when we start talking about certain concepts that aren't culturally shared, but usually we can talk effectively about who and what is where, what we want to do right now, whether it's possible, etc.
Are you suggesting that a lot of LLMs are falling down on the corresponding immediate-and-concrete communicative and practical reasoning tasks specifically because of their political biases?