Just ran and scored 63 3d model generations (via code) across high and no reasoning. 3D Modeling benchmark quickly shows spatial, logic and code performance of the model so I think it's a very good indicator of the quality.
Here are the results compared to Gemini 3.5 Flash:
Model + config CodeErr/gen Cost/gen Median time Quality
gemini-3.5-flash, low 0.71 $0.18 68s baseline
GLM 5.2, reasoning high 0.61 $0.18 289s -6.0%
GLM 5.2, reasoning off 1.52 $0.10 126s -13.6%
Although it is cheaper, it is significantly slower, and results are worse overall. Surprisingly - high reasoning produces less code errors than gemini 3.5 flash, but when I actually look at the models they are worse.Edit: I recently ran evals with Kimi 2.7 and MiniMax-M3 and this is clearly open source SOTA model, by far.
Would you be able to run it against Gemini Flash (not Lite) 3.0, high thinking?
Very interested in this! Can you share more about the modelling method (eg, three js?), the task list, and outputs here?
I think there's probably some good juice to squeeze in terms of spacial awareness by doing a benchmark something like
- give 3d modelling task
- render and snapshot from a variety of angles
- feed to third-party vision model for a "what is this" type query
- grade on end-to-end accuracy
Bonus points for asking the vision model something like "how beautiful is this 1-10".