There was so many OCR models released in the past few months, all VLM models and yet none of them ha...

alaanor • yesterday at 4:31 PM • 2 replies • view on HN

There was so many OCR models released in the past few months, all VLM models and yet none of them handle Korean well. Every time I try with a random screenshot (not a A4 document) they just fail at a "simple" task. And funnily enough Qwen3 8B VL is the best model that usually get it right (although I couldn't get the bbox quite well). Even more funny, whatever is running on an iphone locally on cpu is insanely good, same with google's OCR api. I don't know why we don't get more of the traditional OCR stuff. Paddlepaddle v5 is the closest I could find. At this point, I feel like I might be doing something wrong with those VLMs.

Replies

Stagnant • yesterday at 5:21 PM

Chrome ships a local OCR model for text extraction from PDFs which is better than any of the VLM or open source OCR models i've tried. I had a few hundred gigs of old newspaper scans and after trying all the other options I ended up building a wrapper around the DLL it uses to get the text and bboxes. Performance and accuracy on another level compared to tesseract, and while VLM models sometimes produced good results they just seemed unreliable.

I've thought of open sourcing the wrapper but havent gotten around to it yet. I bet claude code can build a functioning prototype if you just point it to "screen_ai" dir under chrome's user data.

➕ show 3 replies

ghrl • yesterday at 4:42 PM

I remember someone building a meme search engine for millions of images using a cluster of used iPhone SE's because of Apple's very good and fast OCR capabilities. Quite an interesting read as well: https://news.ycombinator.com/item?id=34315782

➕ show 1 reply

alt Hacker News

Replies