Goodhart's law: When a measure becomes a target, it ceases to be a good measure.
AI companies have high incentive to make score go up. They may employ human to write similar-to-benchmark training data to hack benchmark (while not directly train on test).
Throwing your hard problem at work to LLM is a better metric than benchmarks.
I own a business and am constantly using working on using AI in every part of it, both for actual time savings and also as my very practical eval. On the "can this successfully be used to do work that I do or pay someone else to do more quickly/cheaply/etc." eval, I can confirm that models are progressing nicely!