Except coding, where it’s essentially middle of the pack. Which is the only thing that you can build objective benchmarks around. The fact that people on LM arena prefer the output has no relationship to how intelligent the model actually is.