logoalt Hacker News

hooloovoo_zooyesterday at 9:53 PM1 replyview on HN

One sentence summary: We fine-tuned a general-purpose model to produce valid benchmark code results and it got better at producing benchmark code results; we didn't bother to evaluate it on anything the model used to be good at.


Replies

andy_xor_andrewyesterday at 10:04 PM

Not really? If you read it, there is no validation, no correctness signal, no verification, none of that. They're just passing in benchmark inputs, collecting the outputs (regardless of their quality), training on those outputs, and then sweeping the decode settings (temp, topk) of the resulting model. Their conclusion is that this results in a better model than the original - even when taking into consideration the same temp/topk sweep of the original.

So no, they are not fine-tuning a general purpose model to produce "valid benchmark code results."

show 2 replies