DjVu and its connection to Deep Learning (2023)

71 points • by tosh • today at 9:05 AM • 17 comments • view on HN

Comments

As I understand, the technology was protected by a patent help by guys at Leptonica and it exprided. There is a crude project for encoding images to jbig2 at https://github.com/agl/jbig2enc. I am sharing my personal scripts here [1] (windows) that wrap that for end to end djvu to pdf for scanned texts using jbig2 compressed images in the pdf instead of jpeg. This combines decent compression with pdf handiness. djvu still compresses better but pdfs can be got under twice the side, that sounds no impressive, but many common available pipelines produce sizes x3, x4 and worse, a particular offender those using ghostscript pdfwriter. The sripts have worked months locally but are given "as is" without testing, with zero support, you deal with python dependencies and having jbig2 and djvu-libre tools in the path. Beyond image compression tech, they support OCR-layer (cut/pasteability), bookmark and page label migration from djvu to pdf info.

[1] https://github.com/jesuslop/djvu2pdf-test

stared • today at 3:43 PM

Oh, my favourite format during my undergraduate time! Most books in mathematics and physics (some old and niche) were available in the "Russian library".

At the same time, I haven't yet seen DjVu used in a legit way.

➕ show 2 replies

qdotme • today at 2:28 PM

Another reason why I think it failed (TIL Yann LeCun was the coauthor) is the connotation with the pirate books/articles community.

When I came across this format in college days, when handling lots of scanned material, it always triggered the mental “don’t install suspicious software” block. Which is a shame as the article points out it was the superior format.

joecool1029 • today at 5:26 PM

Really hate that archive abandoned it. djvu files are much smaller, faster, and high quality than pdf. Real reason for abandoning it was probably to allow for the DRM needed for controlled access lending, because it’s a garbage choice otherwise.

nico_h • today at 1:44 PM

I don’t know how relevant the samples are, but while the details are lost, the essence seems well preserved. It seems it would be really useful for performing OCR on.

qingcharles • today at 5:11 PM

Ironically, because of poor software support and lack of knowledge about the format, most DjVus are slowly being converted to PDFs.

➕ show 2 replies

vee-kay • today at 7:07 PM

DjVu is excellent format for e-Comics and e-Magazines.

Check out the Amazing Science Fiction Stories, Amazing Stories, Planet Stories, Weird Tales and more.. in DjVu format: https://commons.wikimedia.org/wiki/Category:Scanned_English_...

➕ show 1 reply

alt Hacker News

DjVu and its connection to Deep Learning (2023)

Comments