Where are you getting those benchmark figures from? Math-500 should be closer to 98% for both models: https://artificialanalysis.ai/evaluations/math-500?models=de...