> I don't buy the 10x efficiency thing: they are just lagging behind the performance of current SOTA models. They perform much worse than the current models while also costing much less - exactly what I would expect.
Define "much worse".
+--------------------------------------+-------------+-----------+------------------+
| Benchmark | Claude Opus | DeepSeek | DeepSeek vs Opus |
+--------------------------------------+-------------+-----------+------------------+
| SWE-Bench Verified (coding) | 80.9% | 73.1% | ~90% |
| MMLU (knowledge) | ~91 | ~88.5 | ~97% |
| GPQA (hard science reasoning) | ~79–80 | ~75–76 | ~95% |
| MATH-500 (math reasoning) | ~78 | ~90 | ~115% |
+--------------------------------------+-------------+-----------+------------------+Where are you getting those benchmark figures from? Math-500 should be closer to 98% for both models: https://artificialanalysis.ai/evaluations/math-500?models=de...
Everyone who's used Opus knows it's better than the others in a way that isn't captured by the benchmarks. I would describe it as taste.
Lots of models get really close on benchmarks, but benchmarks only tell us how good they are at solving a defined problem. Opus is far better at solving ill-defined ones.