logoalt Hacker News

383toastyesterday at 11:20 PM2 repliesview on HN

why not a multimodal embedding model?


Replies

efavdbtoday at 1:16 AM

Article says this misses important details, eg data that might be in the image.

show 1 reply
sateeshtoday at 12:17 AM

The article do mentions why they don't use multimodal retrieval. Also I think this approach is cheaper (compute wise) than multimodal retrieval. From the article:

  Multimodal retrieval does not suit this domain. CLIP-style embeddings wash out exactly the fine detail that matters in charts, tables, and annotated screenshots, and short technical queries ("how do I configure X") give too little signal to match against image vectors