Ok that's...just cheating. You can't take a benchmark like MMLU designed to test the performance of a single general language model and compare it to performance of a small specialized model designed to do well on MMLU.
It wasn't designed to do well on MMMLU, it's a general model designed for deterministic task like OCR, object detection, STT and more and a by product of that is great language abilities. It still has a transformer backbone giving great language skills while being good at other stuff.
It wasn't designed to do well on MMMLU, it's a general model designed for deterministic task like OCR, object detection, STT and more and a by product of that is great language abilities. It still has a transformer backbone giving great language skills while being good at other stuff.
See the full benchmark: https://interfaze.ai/leaderboards