That doesn’t work well in a lot of scenarios. The text LLM doesn’t know what to look for in an image...

ricardobeat • yesterday at 8:00 PM • 1 reply • view on HN

That doesn’t work well in a lot of scenarios. The text LLM doesn’t know what to look for in an image before it sees a description, you might need multiple rounds of back and forth.

Replies

jarjoura • yesterday at 8:30 PM

Vision decoding outside of the latent space of the model is lossy, but claude opus's vision isn't that great outside of UI screenshots. I mean it works in a pinch. At least in my testing, if you're looking at non UI images, there are better image to text models that can turn into a very precise documents that any LLM can easily parse.

alt Hacker News

Replies