I hope better and cheaper models will be widely available because competition is good for the business. However, I'm more cautious about benchmark claims. MiniMax 2.1 is decent, but one can really not call it smart. The more critical issue is that MiniMax 2 and 2.1 have the strong tendency to reward hacking, often write nonsensical test report while the tests actually failed. And sometimes it changed the existing code base to make its new code "pass", when it actually should fix its own code instead.
Artificial Analysis put MiniMax 2.1 Coding index on 33, far behind frontier models and I feel it's about right. [1]
MiniMax 2.1 didn't really work for my data-parsing tasks, a lot of errors.
Instead, this one works surprisingly well for the cost: https://openrouter.ai/xiaomi/mimo-v2-flash
That's what I found with some of these LLM models as well. For example I still like to test those models with algorithm problems, and sometimes when they can't actually solve the problem, they will start to hardcode the test cases into the algorithm itself.. Even DeepSeek was doing this at some point, and some of the most recent ones still do this.