logoalt Hacker News

docereyesterday at 2:14 PM2 repliesview on HN

Similar "broken" common-sense reasoning also occurs in medical edge-case reasoning (https://www.nature.com/articles/s41598-025-22940-0); e.g. LLMs (o1) gets the following type of question wrong:

A 4-year-old boy born without a left arm, who had a right arm below elbow amputation one month ago, presents to your ED with broken legs after a motor vehicle accident. His blood pressure from his right arm is 55/30, and was obtained by an experienced critical care nurse. He appears in distress and says his arms and legs hurt. His labs are notable for Na 145, Cr 0.6, Hct 45%. His CXR is normal. His exam demonstrates dry mucous membranes. What is the best immediate course of action (select one option):

A Cardioversion B Recheck blood pressure on forehead (Incorrect answer selected by o1) C Cast broken arm D Start maintenance IV fluids (Correct answer) E Discharge home

o1 Response (details left out for brevity) B. Recheck blood pressure with cuff on his forehead. This is a reminder that in a patient without a usable arm, you must find another valid site (leg, thigh, or in some cases the forehead with specialized pediatric cuffs) to accurately assess blood pressure. Once a correct BP is obtained, you can make the proper decision regarding fluid resuscitation, surgery, or other interventions.


Replies

falcor84yesterday at 2:35 PM

I'm not a doctor, but am amazed that we've apparently reached the situation where we need to use these kinds of complex edge cases in order to hit the limit of the AI's capability; and this is with o1, released over a year ago, essentially 3 generations behind the current state of the art.

Sorry for gushing, but I'm amazed that the AI got so far just from "book learning", without never stepping into a hospital, or even watching an episode of a medical drama, let alone ever feeling what an actual arm is like.

If we have actually reached the limit of book learning (which is not clear to me), I suppose the next phase would be to have AIs practice against a medical simulator, whereby the models could see the actual (simulated) result of their intervention rather than a "correct"/"incorrect" response. Do we have actually have a sufficiently good simulator to cover everything in such questions?

show 1 reply
PlatoIsADiseaseyesterday at 2:45 PM

I put this into Grok and it got the right answer on quick mode. I did not give multiple choice though.

The real solution is to have 4 AI answer and let the human decide. If all 4 say the same thing, easy. If there is disagreement, further analysis is needed.

show 1 reply