logoalt Hacker News

daveguyyesterday at 3:38 PM1 replyview on HN

If your needs are that sensitive, I doubt you'll find anything anytime soon that doesn't require a human in the loop. Even SOTA models only average 95% accuracy on messy inputs. If that's a per character accuracy (which OCR is generally measured by), that's going to be 5+ errors per page of 100+ words. If you really can't afford mistakes you have to consider the OCR inaccurate. If you have key components like "days to respond" and "units vacant" you need to identify the presence of those specifically with bias in favor of false positives (over false negatives), and human confirmation of the source-> OCR.


Replies

kergonathyesterday at 5:04 PM

> If you really can't afford mistakes you have to consider the OCR inaccurate.

Isn’t this close to the error rate of human transcription for messy input, though? I seem to remember a figure in that ballpark. I think if your use case is this sensitive, then any transcription is suspicious.

show 1 reply