logoalt Hacker News

andy_xor_andrewyesterday at 10:04 PM2 repliesview on HN

Not really? If you read it, there is no validation, no correctness signal, no verification, none of that. They're just passing in benchmark inputs, collecting the outputs (regardless of their quality), training on those outputs, and then sweeping the decode settings (temp, topk) of the resulting model. Their conclusion is that this results in a better model than the original - even when taking into consideration the same temp/topk sweep of the original.

So no, they are not fine-tuning a general purpose model to produce "valid benchmark code results."


Replies

fpgamineryesterday at 11:42 PM

Not only that, they additionally ran an experiment with the training temperature turned way up (2.0) and truncation turned off such that the majority of SFT examples were incoherent (63% IIRC). Yet the model finetuned on these broken examples still improved over baseline.

hooloovoo_zooyesterday at 10:18 PM

They are training the model to 1. Produce code (as opposed to answer a question, write a poem, etc.) 2. Produce long enough output to be a valid solution. So they are doing exactly what I said. Cheers.

show 1 reply