logoalt Hacker News

embedding-shapetoday at 10:54 AM0 repliesview on HN

> Images are tokenized

Exactly, here is where the fidelity of an image is being lost, they don't "see" visually, they get a representation of the image via tokens, that's why I said they don't see but basically "see an explanation of the image". I don't mean like a caption, but in the end, they act and work with tokens, not pixels or actual images, internally.

Example from Grok and Claude, with a very simple test case. I made a white image with 7 dots, ask Claude and Grok to count the red dots. The filename is "8-red-dots.png" but actually only has 7 dots.

Because they don't actually receive the image itself, they receive "tokenized images" as you say, they don't seem to actually be able to see the number of red dots. ChatGPT correctly identified that there are only 7 dots, but only because it ended up using Python to actually count the pixels it seems.

Original image + what the various LLMs responded: https://imgur.com/a/vh1tU6Y

Again, very simple (and dumb test), I won't claim this is science, but once you start trying to use these vision models for precise and exact UI and UX work, you'll notice over and over how bad fidelity and spatial awareness they actually have when it comes to images.