They most definitely threw in rl with formal verification somewhere between GPT 4 and now. The models are better at not hallucinating. I don't think their IMO team are only show ponies...