Producing the most plausible code is literally encoded into the cross entropy loss function and is f...

jbergqvist • yesterday at 10:12 PM • 0 replies • view on HN

Producing the most plausible code is literally encoded into the cross entropy loss function and is fundamental to the pre-training. I suppose post training methods like RLVR are supposed to correct for this by optimizing correctness instead of plausibility, but there are probably many artifacts like these still lurking in the model's reasoning and outputs. To me it seems at least possible that the AI labs will find ways to improve the reward engineering to encourage better solutions in the coming years though.

alt Hacker News