logoalt Hacker News

js4evertoday at 10:58 AM1 replyview on HN

"GLM-5.2 hit a problem here, because it can't read images. It isn't multimodal. So instead of looking at a screenshot, it fell back on a hacky workaround: it wrote scripts to read the raw pixel data and check whether the colors came out roughly as expected."

A better way would be to use https://github.com/openbmb/MiniCPM-V


Replies

twobitshiftertoday at 11:10 AM

Right, just give the text llm access to a vision specific agent and that problem can be solved. Or if you really want let it even call Opus with an image - seems like you’d still save money