"GLM-5.2 hit a problem here, because it can't read images. It isn't multimodal. So instead of looking at a screenshot, it fell back on a hacky workaround: it wrote scripts to read the raw pixel data and check whether the colors came out roughly as expected."
A better way would be to use https://github.com/openbmb/MiniCPM-V
Right, just give the text llm access to a vision specific agent and that problem can be solved. Or if you really want let it even call Opus with an image - seems like you’d still save money