Recreating Epstein PDFs from raw encoded attachments

500 points • by ComputerGuru • last Wednesday at 7:19 PM • 184 comments • view on HN

Comments

dperfect • today at 1:26 AM

Nerdsnipe confirmed :)

Claude Opus came up with this script:

https://pastebin.com/ntE50PkZ

It produces a somewhat-readable PDF (first page at least) with this text output:

https://pastebin.com/SADsJZHd

(I used the cleaned output at https://pastebin.com/UXRAJdKJ mentioned in a comment by Joe on the blog page)

➕ show 4 replies

bawolff • yesterday at 11:52 PM

Teseract supports being trained for specific fonts, that would probably be a good starting point

https://pretius.com/blog/ocr-tesseract-training-data

tcgv • today at 4:42 PM

> Then my mom wrote the following: “be careful not to get sucked up in the slime-machine going on here! Since you don’t care that much about money, they can’t buy you at least.”

I'm lucky to have parents with strong values. My whole life they've given me advice, on the small stuff and the big decisions. I didn't always want to hear it when I was younger, but now in my late thirties, I'm really glad they kept sharing it. In hidhsight I can see the life-experience / wisdom in it, and how it's helped and shaped me.

➕ show 1 reply

pyrolistical • yesterday at 11:24 PM

It decodes to binary pdf and there are only so many valid encodings. So this is how I would solve it.

1. Get an open source pdf decoder

2. Decode bytes up to first ambiguous char

3. See if next bits are valid with an 1, if not it’s an l

4. Might need to backtrack if both 1 and l were valid

By being able to quickly try each char in the middle of the decoding process you cut out the start time. This makes it feasible to test all permutations automatically and linearly

➕ show 2 replies

percentcer • yesterday at 11:25 PM

This is one of those things that seems like a nerd snipe but would be more easily accomplished through brute forcing it. Just get 76 people to manually type out one page each, you'd be done before the blog post was written.

➕ show 3 replies

legitster • today at 12:30 AM

Given how much of a hot mess PDFs are in general, it seems like it would behoove the government to just develop a new, actually safe format to standardize around for government releases and make it open source.

Unlike every other PDF format that has been attempted, the federal government doesn't have to worry about adoption.

➕ show 5 replies

ChocMontePy • today at 1:48 AM

You can use the justice.gov search box to find several different copies of that same email.

The copy linked in the post:

https://www.justice.gov/epstein/files/DataSet%209/EFTA004004...

Three more copies:

https://www.justice.gov/epstein/files/DataSet%2010/EFTA02153...

https://www.justice.gov/epstein/files/DataSet%2010/EFTA02154...

Perhaps having several different versions might make it easier.

➕ show 1 reply

sorbus-25 • today at 4:17 AM

Event details: https://web.archive.org/web/20260206040716/https://what2wear...

➕ show 1 reply

kevin_thibedeau • yesterday at 11:53 PM

pdftoppm and Ghostscript (invoked via Imagemagick) re-rasterize full pages to generate their output. That's why it was slow. Even worse with a Q16 build of Imagemagick. Better to extract the scanned page images directly with pdfimages or mutool.

Followup: pdfimages is 13x faster than pdftoppm

➕ show 1 reply

pimlottc • yesterday at 11:08 PM

Why not just try every permutation of (1,l)? Let’s see, 76 pages, approx 69 lines per page, say there’s one instance of [1l] per line, that’s only… uh… 2^5244 possibilities…

Hmm. Anyone got some spare CPU time?

➕ show 3 replies

bushbaba • today at 2:29 AM

This proves my paranoia that you should print and rescan redactions. That or do screenshots of the pdf redacted and convert back to a pdf

➕ show 3 replies

chrisjj • yesterday at 11:15 PM

> it’s safe to say that Pam Bondi’s DoJ did not put its best and brightest on this

Or worse. She did.

➕ show 3 replies

velaia • today at 12:13 AM

Bummer that it's not December - the https://www.reddit.com/r/adventofcode/ crows would love this puzzle

nubg • today at 12:50 AM

Wait would this give us the unredacted PDFs?

➕ show 3 replies

ks2048 • today at 6:22 AM

I wonder if jmail (https://www.jmail.world/) has worked on this?

I tried to find the message in this blog post, but couldn't. (don't see how to search by date).

linuxguy2 • yesterday at 10:47 PM

Love this, absolutely looking forward to some results.

FarmerPotato • yesterday at 11:01 PM

If only Base64 had used a checksum.

➕ show 1 reply

Evidlo • today at 2:47 AM

I took at stab at training Tesseract and holy jeebus is their CLI awful. Just an insanely complicated configuration procedure.

➕ show 1 reply

wtcactus • today at 7:38 AM

My non political take about this gift that keeps on giving is that: PDF might seem great for the end user that is just expected to read or print the file they are given, but the technology actually sucks.

PDF is basically a prettify layer on top of the older PS that brings an all lot of baggage. The moment you start trying to do what should be simple stuff like editing lines, merging pages, change resolution of the images, it starts giving you a lot of headaches.

I used to have a few scripts around to fight some of its quirks from when I was writing my thesis and had to work daily with it. But well, it was still an improvement over Word.

➕ show 1 reply

zahlman • yesterday at 11:26 PM

> …but good luck getting that to work once you get to the flate-compressed sections of the PDF.

A dynamic programming type approach might still be helpful. One version or other of the character might produce invalid flate data while the other is valid, or might give an implausible result.

➕ show 2 replies

iwontberude • yesterday at 11:01 PM

This one is irresistible to play with. Indeed a nerd snipe.

➕ show 1 reply

winddude • today at 4:28 AM

here's another few to decode,

https://www.justice.gov/epstein/files/DataSet%2010/EFTA01804...

https://www.justice.gov/epstein/files/DataSet%209/EFTA007755...

https://www.justice.gov/epstein/files/DataSet%209/EFTA004349...

and than this one judging by the name of the file (hanna something) and content of the email:

"Here is my girl, sweet sparkling Hanna=E2=80=A6! I am sure she is on Skype "

maybe more sinister (so be careful, i have no ideas what the laws are if you uncover you know what trump and Epstein were into)...

https://www.justice.gov/epstein/files/DataSet%2011/EFTA02715...

[Above is probably a legit modeling CV for HANNA BOUVENG, based on, https://www.justice.gov/epstein/files/DataSet%209/EFTA011204..., but still creepy, and doesn't seem like there's evidence of her being a victim]

➕ show 2 replies

queenkjuul • today at 3:03 AM

I'm only here to shout out fish shell, a shell finally designed for the modern world of the 90s

eek2121 • yesterday at 11:41 PM

Honestly, this is something that should've been kept private, until each and every single one of the files is out in the open. Sure, mistakes are being made, but if you blast them onto the internet, they WILL eventually get fixed.

Cool article, however.

➕ show 1 reply

SomaticPirate • today at 4:18 AM

Are there archives of this? I have no doubt after this post goes viral some of these files might go “missing” Having a large number of conspiracies validated has lead me to firmly plant my aluminum hat

➕ show 1 reply

blindriver • today at 12:15 AM

On one hand, the DOJ gets shit because it was taking too long to produce the documents, and then on another, they get shit because there are mistakes in the redacting because there are 3 million pages of documents.

➕ show 7 replies

IshKebab • today at 2:22 PM

Disappointing how terrible open source OCR still is.

prettywoman • yesterday at 11:19 PM

[dead]

heraldgeezer • today at 7:27 AM

[flagged]

➕ show 1 reply

alt Hacker News

Recreating Epstein PDFs from raw encoded attachments

Comments