logoalt Hacker News

lxgryesterday at 10:43 AM1 replyview on HN

> regular pdfs that have been produced in a “normal way” (ie using latex or a modern application with a “save to pdf” function) will contain the text

Producing "normal PDFs" that way actually requires specific LaTeX options to be enabled in my experience. Without that, PDF viewers have to perform all kinds of ugly hacks to even figure out what Unicode codepoint a given glyph is supposed to represent! PDFs are much more of a vector format than a layouting program than most people seem to realize.

> One thing I have consistently found useful is to look for “mojibake” (scrambled text caused by decoding in an unintended character encoding)

This is exactly the problem with PDFs: It's not regular mojibake (i.e. interpreting a string of text in the wrong charset), but rather some PDF processor's failed attempt at mapping glyphs back to codepoints without an explicit mapping table being present in the PDF, which is something that the creator actively has to do.

> “Image” Pdfs that have been produced via a scan so don’t actually contain a text transcript require actual OCR.

For the reason above and others, in my experience, OCR actually works significantly better than trying to "semantically parse" the PDF.


Replies

seanhunteryesterday at 2:22 PM

Hmm. Not sure what I'm doing that's special but both latex pdfs I produce and others that I read generally work just fine with pypdf, and I really am not adding any flags at all (my makefile says I just go

   latexmk --lualatex -aux-directory=output -output-directory=output $<
). Maybe latexmk is adding some magic?
show 1 reply