And what do you base this on ?
How does one objectively quantify how it stacks upnto another model ?
Or even, what is your subjective evaluation based on ?
I really wonder - because I have just finished a fully vibe-coded gtk/rust/lua application with me basically writing 7% of the code (all in one module) and GLM 5.1 writing the rest. We haven’t had regressions, confusion or anything else. And I am pretty damned sure I couldn’t manage this one year ago with claude code and Sonnet.
What harness, if you don't mind sharing?