Speaking as someone who thinks the Chinese Room argument is an obvious case of begging the question, GP isn't about that. They're not saying that LLMs don't have world models - they're saying that those world models are not based in physical world and thus cannot properly understand what they talk about.
I don't think that's true anymore, though. All the SOTA models are multimodal now, meaning that they are trained on images and videos as well, not just text; and they do that is precisely because it improves the text output as well, for this exact reason. Already, I don't have to waste time explaining to Claude or Codex what I want on a webpage - I can just sketch a mock-up, or when there's a bug, I take a screenshot and circle the bits that are wrong. But this extends into the ability to reason about real world, as well.
Speaking as someone who thinks the Chinese Room argument is an obvious case of begging the question, GP isn't about that. They're not saying that LLMs don't have world models - they're saying that those world models are not based in physical world and thus cannot properly understand what they talk about.
I don't think that's true anymore, though. All the SOTA models are multimodal now, meaning that they are trained on images and videos as well, not just text; and they do that is precisely because it improves the text output as well, for this exact reason. Already, I don't have to waste time explaining to Claude or Codex what I want on a webpage - I can just sketch a mock-up, or when there's a bug, I take a screenshot and circle the bits that are wrong. But this extends into the ability to reason about real world, as well.