logoalt Hacker News

trees101today at 3:43 AM5 repliesview on HN

what is a good way to read PDFs using AI?


Replies

seanhuntertoday at 5:14 AM

In my experience it really depends on what sort of pdfs you are trying to extract (ie what the content is).

For regular pdfs that have been produced in a “normal way” (ie using latex or a modern application with a “save to pdf” function) will contain the text and for those I’ve had a lot of success on general pdfs using pypdf.

“Image” Pdfs that have been produced via a scan so don’t actually contain a text transcript require actual OCR. At the moment my personal rag pipeline is doing this using a local Gemma4 model (you could use something else).

Either way I do an audit post-ingest where I select a random set of pages and also get the local gemma model to try those same set and compare. The symptoms to look out for here will depend a lot on what you’re trying to extract but I’m extracting maths mostly so I get the model to check extraction of symbols, equations etc. One thing I have consistently found useful is to look for “mojibake” (scrambled text caused by decoding in an unintended character encoding) as this almost always catches pdfs that have just extracted as pure garbage. I added this step because I was ingesting a lot of old maths pdfs which have specialist notation that wasn’t always getting correctly ingested and as they were image pdfs it was coming in as pure garbage. So the fix here is to use a specialist ocr service (I have been using “mathpix” which has been great and isn’t too expensive if you don’t want to do too much).

The other thing that can cause problems is things like tables (eg if you were trying to ingest a lot of pdfs like financials of companies etc). Those can cause problems for both the ocr and the pure text extraction methods. I don’t have a current recommendation for that because I haven’t done it recently enough and the state of the art has moved a lot. It’s something to be aware of that will require special treatment though.

show 1 reply
lostsocktoday at 4:02 AM

I have a standing instruction for any documents that can't natively be read by a given AI to first be converted into .md using https://github.com/microsoft/markitdown which I've found to work really well

wwn_setoday at 7:25 AM

Doing a preprocess using some pdf extraction and ocr tool and then feeding that to the big model is usually way more stable.

chrswtoday at 11:27 AM

In the broadest sense, I don't think we're there yet. I asked an SoC vendor to provide their chip documentation in Markdown. They refused. So, I went ahead and tried to do myself with AI.

I tried various AI tools and the results ranged from absolute garbage to something-but-not-something-but-not-quite.

I went ahead and did a section of a huge PDF by hand, just to see if what I was asking for was even feasible. After more than several hours of painstaking work spread across multiple days, I got several chapters to look identical to the source PDF in some Markdown renderers. I had to use some HTML for the more complex tables. I converted some diagrams to Markdown and some to images linked to from the Markdown.

rawoke083600today at 8:41 AM

MinerU works well to get it markdown