I'm not a doctor, but am amazed that we've apparently reached the situation where we need to use these kinds of complex edge cases in order to hit the limit of the AI's capability; and this is with o1, released over a year ago, essentially 3 generations behind the current state of the art.
Sorry for gushing, but I'm amazed that the AI got so far just from "book learning", without never stepping into a hospital, or even watching an episode of a medical drama, let alone ever feeling what an actual arm is like.
If we have actually reached the limit of book learning (which is not clear to me), I suppose the next phase would be to have AIs practice against a medical simulator, whereby the models could see the actual (simulated) result of their intervention rather than a "correct"/"incorrect" response. Do we have actually have a sufficiently good simulator to cover everything in such questions?
I agree that the necessity to design complex edge cases to find AI reasoning weaknesses indicates how far their capabilities have come. However, from a different point of view, failures of these types of edge cases which can be solved via "common-sense" also indicate how far AI has yet to go. These edge cases (e.g. blood pressure or car wash scenario) despite being somewhat construed are still “common-sense” in that an average human (or med student in the blood pressure scenario) can reason through them with little effort. AI struggling on these tasks indicates weaknesses in their reasoning, e.g. their limited generalization abilities.
The simulator or world-model approach is being investigated. To your point, textual questions alone do not provide adequate coverage to assess real-world reasoning.
These failure modes are not AI’s edge cases at the limit of its capabilities. Rather they demonstrate a certain category of issues with generalization (and “common sense”) as evidenced by the models’ failure upon slight irrelevant changes in the input. In fact this is nothing new, and has been one of LLMs fundamental characteristics since their inception.
As for your suggestion on learning from simulations, it sounds interesting, indeed, for expanding both pre and post training but still that wouldn’t address this problem, only hides the shortcomings better.