logoalt Hacker News

dperfectyesterday at 8:04 PM0 repliesview on HN

That's true if we're correcting OCR of actual output text. In this case, it's operating on the base 64 text, trying to produce chunks that form valid zlib streams and PDF syntax so the file can be intact enough to be opened. "Just accepting errors" would mean not seeing any content in the file because it cannot be read.

So yes, the "fixed" output has errors, but it’s not hallucinating details like an LLM, nor is it trying to produce output that conforms to any linguistic or stylistic heuristics.

The phrase "correcting similar OCR'd PDFs" should have been "correcting similar OCR'd base 64 representations of PDFs".