logoalt Hacker News

dperfectyesterday at 6:06 PM1 replyview on HN

Letting Claude work a little longer produced this behemoth of a script (which is supposed to be somewhat universal in correcting similar OCR'd PDFs - not yet tested on any others though): https://pastebin.com/PsaFhSP1

which uses this Rust zlib stream fixer: https://pastebin.com/iy69HWXC

and gives the best output I've seen it produce: https://imgur.com/itYWblh

This is using the same OCR'd text posted by commenter Joe.


Replies

daveguyyesterday at 6:58 PM

> which is supposed to be somewhat universal in correcting similar OCR'd PDFs

Xerox would like a word.

https://news.ycombinator.com/item?id=29223815

Point being, "correcting" to "correct looking" may be worse than just accepting errors. Errors are often clearly identified by humans as a nonsense word. "Correcting" OCR can result in plausible, but wrong results that are more difficult for the human in the loop to identify.

show 1 reply