It was interesting to find out that Qwen 2.5 VL can output coordinates like Sonnet 4, or does that u...

ewired • today at 12:16 AM • 1 reply • view on HN

It was interesting to find out that Qwen 2.5 VL can output coordinates like Sonnet 4, or does that use a different implementation?

Replies

anerli • today at 12:38 AM

Both of them are "visually grounded" - meaning if you ask for the location of something in an image - they can output the exact x/y pixel coordinates! Not many models can do this, especially not many that are large enough to actually reason through sequences of actions well

alt Hacker News

Replies