Both of them are "visually grounded" - meaning if you ask for the location of something in an image - they can output the exact x/y pixel coordinates! Not many models can do this, especially not many that are large enough to actually reason through sequences of actions well
Both of them are "visually grounded" - meaning if you ask for the location of something in an image - they can output the exact x/y pixel coordinates! Not many models can do this, especially not many that are large enough to actually reason through sequences of actions well