logoalt Hacker News

hodgehog11last Saturday at 10:58 PM4 repliesview on HN

This argument, that LLMs can develop new crazy strategies using RLVR on math problems (like what happened with Chess), turns out to be false without a serious paradigm shift. Essentially, the search space is far too large, and the model will need help to explore better, probably with human feedback.

https://arxiv.org/abs/2504.13837


Replies

ineedasernameyesterday at 6:04 PM

That linked article says its about RLVR but then goes on to conflate other RL with it, and doesn't address much in the way of the core thinking that was in the paper they were partially responding to that had been published a month earlier[0] which laid out findings and theory reasonably well, including work that runs counter to the main criticism in the article you cited, ie, performance at or above base models only being observed with low K examples.

That said, reachability and novel strategies are somewhat overlapping areas of consideration, and I don't see many ways in which RL in general, as mainly practiced, improves upon models' reachability. And even when it isn't clipping weights it's just too much of a black box approach.

But none of this takes away from the question of raw model capability on novel strategies, only such with respect to RL.

[0] https://arxiv.org/pdf/2506.14245

throwaway27448yesterday at 8:20 AM

I agree that LLMs are a bad fit for mathematical reasoning, but it's very hard for me to buy that humans are a better fit than a computational approach. Search will always beat our intuition.

show 1 reply
narratorlast Saturday at 11:44 PM

The search space for the game of Go was also thought to be too large for computers to manage.

show 2 replies