Some of the coding-specific fine-tunes were really impressive boosts. Qwen2.5-3B-Instruct is also available [0] -- if it's not too much to ask, I'd be curious how more general models stack up in your benchmark?
[0] - https://huggingface.co/Qwen/Qwen2.5-3B-Instruct