logoalt Hacker News

erutoday at 7:11 AM1 replyview on HN

This might be like an observational study vs a study with a control?


Replies

anon373839today at 7:30 AM

From what I understand, at this point, the main value of stronger model outputs is simply to bootstrap reasoning behavior during the RL post-training phase. It gets you past the “cold start” problem with RL, after which the outputs aren’t needed anymore. From then on, it’s hill climbing and that requires environments for the model to interact with get rewards from.