I gave 4.6, 4.7 and GPT 5.5 the same prompt and task to reverse engineer a collection of sample vector files from an obscure Amiga CAD program and create a detailed txt specification and a python converter that converts to SVG and produce a report so I can visually verify.
4.6 did very well. 90% perfect on first try, got to 100% with just a few followups. 4.7 failed horribly. First produced garbage output and claimed it was done, admitted it did that when called out, proceeded to work at it a lot longer and then IT GAVE UP. GPT 5.5 codex was shockingly good. Achieved 90% perfect on first try in about a fourth of the time. Got to 100% faster and with fewer follow-ups.
I’m impressed.
Interesting that 4.7 failed like that. Seems 5.5 is impressive but is oh so expensive.
Would be interesting if you ran your same test with Deepseek v4 and some of the other Chinese models.