No. There's no "answer" really. They use self-distillation to shift the output dist...

ACCount37 • yesterday at 12:27 PM • 0 replies • view on HN

No. There's no "answer" really.

They use self-distillation to shift the output distribution of the model towards that of the same model, but running with different temperature/truncation settings in sampling.

This effectively "folds" the logit tail truncation behavior into the model itself.

Not entirely unlike a few "model controlled sampling settings" things I've seen in what it does, but different in execution.

alt Hacker News