They do improve, but the general creativity and sparkle we see with increasing scale comes mostly from scaling up pretraining/parameter-size, so it's quite slow and expensive compared to the speed (and decreasing cost) people have come to take for granted in math/coding in small cheap models. Hence the reaction to GPT-4.5: exactly as much better taste and discernment as it should have had based on scaling laws, yet regarded almost universally as a colossal failure. It was as unpopular as the original GPT-3 was when the paper was released, because people look at the log-esque gains from scaling up 10x or 100x and are disappointed. "Is that all?! What has the Bitter Lesson or scaling done for me lately?"
So, you can expect coding skills to continue to outpace the native LLM taste.
I think we're basically agreeing here. Your point (if I'm reading it right) is that taste and discernment do scale, but the gains come through pretraining/parameter scaling, which is slow and expensive compared to the fast, cheap wins in math/coding from smaller models. So taste is more of a lagging indicator of scale. it improves, but it's the last thing people notice because the benchmarkable stuff races ahead. Which also means taste isn't really a moat, just late to get commoditized.