I have just been humbled by the Gemma 4 26B QAT build (unsloth's version), which insisted repeatedly that I am wrong in my requirements for some niche wordpress code, which cannot be satisfied.
I am a good WP developer so I kept prodding it and it kept insisting, and it explained with clarity. Turns out it is right and I was wrong, as I would have found out if I'd written the code myself.
I've been using this particular test for days, experimenting in ways to generate and prompt code. The 4-bit quantisation of the pre-QAT model does not catch this error. And nor can the Qwen 3.6 sparse model, which confidently blazed past it and never mentioned it.
(FWIW neither did plain ChatGPT; maybe Codex would)
Anecdotal, but there you go. I am somewhat weirded out by it.