How we index images for RAG

191 points • by mooreds • yesterday at 4:13 PM • 26 comments • view on HN

Comments

With media ingestion this is called "eager" processing. Historically for things like pulling thumbnails for images / video and pre-generating common sizes for things. This follows the same pattern and makes all the sense in the world. My only concern is that due to the non deterministic nature of LLMs new models will reveal new information about your data.

For example you might identify a car in an image but the context is the car running a red light. A new model might pick that up while an old one doesn't. These context adjustments might sometimes require you to rerun your LLM processing or potentially have a one to many relationship for multiple runs so you can take the best of or combine results.

Actual usage will also reveal most commonly used assets and you can target the ones that are most trafficked and save a ton on processing that way.

bad_username • yesterday at 8:13 PM

> we don't send images to the model at query time. We describe each image once, at indexing time, with a cheap vision model, store the descriptions as text, and retrieve them alongside ordinary text chunks

This is what I've been doing in my Obsidian infodump for a while. If I know that an image is important, I generate a text description (Mermaid if possible, English if not) and paste it after the image in a block. This lets agents see the image if they don't really see it. Though my process is manual, the improvements in outcomes for agents that rely on text search/retrieval is very real and is worth it.

➕ show 1 reply

m4rkuskk • today at 6:48 AM

Its an ad for their product. There is nothing special about this approach and probably done the same way by everyone else.

furyman • today at 12:09 AM

Well I don't know if this one has been getting by others too but I have been doing this since 2 years ago and it works really well. Except the fact that for the documents I had to chunk containing these images I had to chase the authors(multiple of them) to update the relevant captions for their images. It is cost efficient than multi-modal. Lesser ingestion time altogether. Only part is that if the retrieval query is a question which can be answered only after looking at the image, then this architecture would need some little modification.

fhouser • yesterday at 9:55 PM

That's smart. Just the other day, I was thinking about how I would solve images/graphs/rich PDF stuff in a RAG system. Now I know more, thanks!

vessenes • today at 12:12 PM

"This is what makes the load-bearing case work,"

Man I hate that AI writing tic. I appreciate the instincts for sharing the workflow. It's still very difficult to get AI to put an info dense description together though, we tend to get long and vague.

relevant_stats • today at 9:17 AM

Seriously?

- Marketing material? check

- Bloated to the extreme? check

- "Get a free trial" at the end? check

- Entirely LLM generated? check

EGreg • yesterday at 10:01 PM

We have it in our open-source framework, in case anyone wants to deploy it:

https://github.com/Qbix/AI/blob/6753f6e453908682401f49760002...

https://github.com/Qbix/AI/blob/main/config/observations.jso...

wrote it up here a few months ago: https://community.safebots.ai/t/building-cultural-infrastruc...

383toast • yesterday at 11:20 PM

why not a multimodal embedding model?

➕ show 2 replies

iot_devs • today at 1:33 AM

How descripting is the caption that you obtain?

So you include colour, shapes, etc?

sanreds • today at 2:16 PM

[flagged]

songting591 • today at 2:15 PM

[flagged]

haeseong • today at 12:22 PM

[flagged]

infoinlet • today at 10:40 AM

[flagged]

hanzeweiasa • today at 5:54 AM

[flagged]

davidladdsource • yesterday at 10:37 PM

[flagged]

justacatbot • yesterday at 4:27 PM

[flagged]

EvanXue • today at 3:01 AM

[flagged]

factden • today at 9:47 AM

[dead]

hparadiz • yesterday at 8:09 PM

That cookie popup just makes me wanna leave and never come back

➕ show 1 reply

hbwang2076 • today at 2:04 AM

视觉分块思路可以。但图文混合的图怎么办？CLIP认风格，不认结构意图。

alt Hacker News

How we index images for RAG

Comments