Well I don't know if this one has been getting by others too but I have been doing this since 2 years ago and it works really well. Except the fact that for the documents I had to chunk containing these images I had to chase the authors(multiple of them) to update the relevant captions for their images. It is cost efficient than multi-modal. Lesser ingestion time altogether. Only part is that if the retrieval query is a question which can be answered only after looking at the image, then this architecture would need some little modification.