logoalt Hacker News

visargayesterday at 3:00 PM0 repliesview on HN

No, they do RLVR (reinforcement learning with verifiable rewards) like everyone else. And probably use claude data too, with human in the loop and tool feedback.