logoalt Hacker News

coder543yesterday at 3:44 PM2 repliesview on HN

If you want OCR with the big LLM providers, you should probably be passing one page per request. Having the model focus on OCR for only a single page at a time seemed to help a lot in my anecdotal testing a few months ago. You can even pass all the pages in parallel in separate requests, and get the better quality response much faster too.

But, as others said, if you can't afford mistakes, then you're going to need a human in the loop to take responsibility.


Replies

staticman2yesterday at 5:26 PM

Gemini Pro 3 seems to be built for handling multiple page PDFs.

I can feed it a multiple page PDF and tell it to convert it to markdown and it does this well. I don't need to load the pages one at a time as long as I use the PDF format. (This was tested on A.i. studio but I think the API works the same way).

show 1 reply
HPsquaredyesterday at 4:24 PM

You could maybe then do a second pass on the whole text (as plain text not OCR) to look for likely mistakes.

show 1 reply