My guess is that post-training has gotten a lot better in the last couple of years and what people are attributing to better models are actually just traditional (non-LLM) models they place on top of the LLM which makes it appears that the model has increased in quality (including by seemingly fewer hallucination).
If this is the case it would be observed with different prompting strategies, when you find a prompt which puts more weight on the post-training models.