I have nothing to contribute but speculation based on my intuition, but IMO RLHF (or rather human pr...

orbital-decay • yesterday at 10:21 AM • 1 reply • view on HN

I have nothing to contribute but speculation based on my intuition, but IMO RLHF (or rather human preference modeling in general, including the post-training dataset formatting) is a relatively small factor in this, RL-induced mode collapse is much bigger one. Take a look at the original DeepSeek R1 Zero, the point of which was to train a model with very little human preference, because they've been on a budget and human preference doesn't scale. It's pretty unhinged in its writing, like the base model, but unlike the base model it converges onto stable writing patterns, and the output diversity is as non-existent as in models with carefully engineered "personalities" like Claude. Ask it to name a random city and look at the logits, and you'll still see a pretty narrow distribution. At the same time some models with RLHF (e.g. the old RedPajama) have more diverse outputs.

Collapsed mode makes the models truncate entire token trajectories, repeat themselves, and indirectly it does something MUCH deeper, they converge on almost 1:1 input-to-output concept mapping (instead of one-to-many, like in base models). Same lack of variety can be seen in diffusion models, GANs, VAEs and any other model regardless of the type and receiving human preference.

Moreover, these patterns are generational. Old ones get replaced with new ones, and the list in the OP is going to be obsolete in a year. This is what already happened to previous models several times, from what I can tell. Supposedly this is because they scrape the web polluted by previous gen models.

Replies

lelanthran • yesterday at 12:43 PM

Doesn't this apply to all output from a model, not just English?

IOW, won't code generated by the model have the same deficiencies with respect to lack of diversity?

➕ show 1 reply

alt Hacker News

Replies