I think people are misunderstanding reward functions and LLMs.
LLMs don't actually have a reward system like some other ML models.
They are trained with one, and when you look at DPO you can say they contain an implicit one as well.
They are trained with one, and when you look at DPO you can say they contain an implicit one as well.