For the evaluation question: for small code models, try-to-compile rate on generated functions is the simplest metric that actually correlates with usefulness. Perplexity tells you the model learned the distribution, compilation rate tells you it learned the structure. Beyond that, exact match on function body completion given a signature is more informative than open ended generation benchmarks