While I partially agree with you, there IS work being done to make the metrics comparable. Eg:

taegee • today at 8:38 AM • 0 replies • view on HN

https://ghzhang233.github.io/blog/2026/03/05/train-before-te...

It just hasn't been widely adopted yet. And it might be in each of their particular interests that it continues to stay so for a while. It's basically like p-hacking.

alt Hacker News