you're probably overcomplicating it; as the paper says, it's embarrassingly simple: given a problem set, generate a response for each problem with a fixed temperature and truncation - then fine tune the model on the generations.
Their hypothesis as to why this works requires a bit more knowledge about model architecture, but basically when a model generates code some positions have only one right answer and some have many valid options - but the model has to use one global confidence setting for both. Sampling with a specific temperature + a garbage-token filter, then training on those outputs, teaches the model to internalize 'be precise where there's one answer, stay open-minded where there are several' — without anyone labeling which is which.
Note that there's a lot more nuance to this and I simplified a lot.