While I partially agree with you, there IS work being done to make the metrics comparable. Eg:
https://ghzhang233.github.io/blog/2026/03/05/train-before-te...
It just hasn't been widely adopted yet. And it might be in each of their particular interests that it continues to stay so for a while. It's basically like p-hacking.