This might be like an observational study vs a study with a control?

eru • today at 7:11 AM • 1 reply • view on HN

Replies

From what I understand, at this point, the main value of stronger model outputs is simply to bootstrap reasoning behavior during the RL post-training phase. It gets you past the “cold start” problem with RL, after which the outputs aren’t needed anymore. From then on, it’s hill climbing and that requires environments for the model to interact with get rewards from.

alt Hacker News

Replies