Tbf, most of the "real benchmarks" have issues that are just as bad. Assessing LLM performance is just hard
And personal too. Different engineers are using them for different use cases.
And personal too. Different engineers are using them for different use cases.