Harness certainly matters a lot, though GLM is pretty forgiving. I just had Opus tell me that based on numbers over the last week, from quite a few billion tokens total across half a dozen providers, GLM 5.1 has been more reliable for one of my projects than Sonnet... Just switching on 5.2 now.
How are you collecting your metrics on token usage and reliability?