very bad take. with most modern multomodal models you get way better performance then going to text first
it's a cost/latency trade-off in production + very use-case dependent
it's a cost/latency trade-off in production + very use-case dependent