Someone needs to make an actual good benchmark for LLM's that matches real world expectations, theres more to benchmarks than accuracy against a dataset.
We don't need real world benchmarks, if they were good for real world tasks people would use them We need scientific benchmarks that tease out the nature of intelligence. There are plenty of unsaturated benchmarks. Solving chess using "mostly" language modeling is still an open problem. And beyond that creating a machine that can explain why that move is likely optimal at some depth. AI that can predict the output of another AI.
this reminds me of that joke of someone saying "it's crazy that we have ten different standards for doing this", and then there're 11 standards