Not OP, but we use the docling library to extract text and put it in markdown before storing for use with an LLM.