Benchmarking has been already known to be far from a signal of quality for LLMs, but it's the "best" standardized way so far. Few exists like the food truck and the svg test. At the end of the day, there is only 1 way: having your own benchmark for your own application.