Article says this misses important details, eg data that might be in the image.
very bad take. with most modern multomodal models you get way better performance then going to text first
very bad take. with most modern multomodal models you get way better performance then going to text first