logoalt Hacker News

Recreating Epstein PDFs from raw encoded attachments

500 pointsby ComputerGurulast Wednesday at 7:19 PM184 commentsview on HN

Comments

dperfecttoday at 1:26 AM

Nerdsnipe confirmed :)

Claude Opus came up with this script:

https://pastebin.com/ntE50PkZ

It produces a somewhat-readable PDF (first page at least) with this text output:

https://pastebin.com/SADsJZHd

(I used the cleaned output at https://pastebin.com/UXRAJdKJ mentioned in a comment by Joe on the blog page)

show 4 replies
bawolffyesterday at 11:52 PM

Teseract supports being trained for specific fonts, that would probably be a good starting point

https://pretius.com/blog/ocr-tesseract-training-data

tcgvtoday at 4:42 PM

> Then my mom wrote the following: “be careful not to get sucked up in the slime-machine going on here! Since you don’t care that much about money, they can’t buy you at least.”

I'm lucky to have parents with strong values. My whole life they've given me advice, on the small stuff and the big decisions. I didn't always want to hear it when I was younger, but now in my late thirties, I'm really glad they kept sharing it. In hidhsight I can see the life-experience / wisdom in it, and how it's helped and shaped me.

show 1 reply
pyrolisticalyesterday at 11:24 PM

It decodes to binary pdf and there are only so many valid encodings. So this is how I would solve it.

1. Get an open source pdf decoder

2. Decode bytes up to first ambiguous char

3. See if next bits are valid with an 1, if not it’s an l

4. Might need to backtrack if both 1 and l were valid

By being able to quickly try each char in the middle of the decoding process you cut out the start time. This makes it feasible to test all permutations automatically and linearly

show 2 replies
percentceryesterday at 11:25 PM

This is one of those things that seems like a nerd snipe but would be more easily accomplished through brute forcing it. Just get 76 people to manually type out one page each, you'd be done before the blog post was written.

show 3 replies
legitstertoday at 12:30 AM

Given how much of a hot mess PDFs are in general, it seems like it would behoove the government to just develop a new, actually safe format to standardize around for government releases and make it open source.

Unlike every other PDF format that has been attempted, the federal government doesn't have to worry about adoption.

show 5 replies
ChocMontePytoday at 1:48 AM

You can use the justice.gov search box to find several different copies of that same email.

The copy linked in the post:

https://www.justice.gov/epstein/files/DataSet%209/EFTA004004...

Three more copies:

https://www.justice.gov/epstein/files/DataSet%2010/EFTA02153...

https://www.justice.gov/epstein/files/DataSet%2010/EFTA02154...

https://www.justice.gov/epstein/files/DataSet%2010/EFTA02154...

Perhaps having several different versions might make it easier.

show 1 reply
kevin_thibedeauyesterday at 11:53 PM

pdftoppm and Ghostscript (invoked via Imagemagick) re-rasterize full pages to generate their output. That's why it was slow. Even worse with a Q16 build of Imagemagick. Better to extract the scanned page images directly with pdfimages or mutool.

Followup: pdfimages is 13x faster than pdftoppm

show 1 reply
pimlottcyesterday at 11:08 PM

Why not just try every permutation of (1,l)? Let’s see, 76 pages, approx 69 lines per page, say there’s one instance of [1l] per line, that’s only… uh… 2^5244 possibilities…

Hmm. Anyone got some spare CPU time?

show 3 replies
bushbabatoday at 2:29 AM

This proves my paranoia that you should print and rescan redactions. That or do screenshots of the pdf redacted and convert back to a pdf

show 3 replies
chrisjjyesterday at 11:15 PM

> it’s safe to say that Pam Bondi’s DoJ did not put its best and brightest on this

Or worse. She did.

show 3 replies
velaiatoday at 12:13 AM

Bummer that it's not December - the https://www.reddit.com/r/adventofcode/ crows would love this puzzle

nubgtoday at 12:50 AM

Wait would this give us the unredacted PDFs?

show 3 replies
ks2048today at 6:22 AM

I wonder if jmail (https://www.jmail.world/) has worked on this?

I tried to find the message in this blog post, but couldn't. (don't see how to search by date).

linuxguy2yesterday at 10:47 PM

Love this, absolutely looking forward to some results.

FarmerPotatoyesterday at 11:01 PM

If only Base64 had used a checksum.

show 1 reply
Evidlotoday at 2:47 AM

I took at stab at training Tesseract and holy jeebus is their CLI awful. Just an insanely complicated configuration procedure.

show 1 reply
wtcactustoday at 7:38 AM

My non political take about this gift that keeps on giving is that: PDF might seem great for the end user that is just expected to read or print the file they are given, but the technology actually sucks.

PDF is basically a prettify layer on top of the older PS that brings an all lot of baggage. The moment you start trying to do what should be simple stuff like editing lines, merging pages, change resolution of the images, it starts giving you a lot of headaches.

I used to have a few scripts around to fight some of its quirks from when I was writing my thesis and had to work daily with it. But well, it was still an improvement over Word.

show 1 reply
zahlmanyesterday at 11:26 PM

> …but good luck getting that to work once you get to the flate-compressed sections of the PDF.

A dynamic programming type approach might still be helpful. One version or other of the character might produce invalid flate data while the other is valid, or might give an implausible result.

show 2 replies
iwontberudeyesterday at 11:01 PM

This one is irresistible to play with. Indeed a nerd snipe.

show 1 reply
winddudetoday at 4:28 AM

here's another few to decode,

https://www.justice.gov/epstein/files/DataSet%2010/EFTA01804...

https://www.justice.gov/epstein/files/DataSet%209/EFTA007755...

https://www.justice.gov/epstein/files/DataSet%209/EFTA004349...

and than this one judging by the name of the file (hanna something) and content of the email:

"Here is my girl, sweet sparkling Hanna=E2=80=A6! I am sure she is on Skype "

maybe more sinister (so be careful, i have no ideas what the laws are if you uncover you know what trump and Epstein were into)...

https://www.justice.gov/epstein/files/DataSet%2011/EFTA02715...

[Above is probably a legit modeling CV for HANNA BOUVENG, based on, https://www.justice.gov/epstein/files/DataSet%209/EFTA011204..., but still creepy, and doesn't seem like there's evidence of her being a victim]

show 2 replies
queenkjuultoday at 3:03 AM

I'm only here to shout out fish shell, a shell finally designed for the modern world of the 90s

eek2121yesterday at 11:41 PM

Honestly, this is something that should've been kept private, until each and every single one of the files is out in the open. Sure, mistakes are being made, but if you blast them onto the internet, they WILL eventually get fixed.

Cool article, however.

show 1 reply
SomaticPiratetoday at 4:18 AM

Are there archives of this? I have no doubt after this post goes viral some of these files might go “missing” Having a large number of conspiracies validated has lead me to firmly plant my aluminum hat

show 1 reply
blindrivertoday at 12:15 AM

On one hand, the DOJ gets shit because it was taking too long to produce the documents, and then on another, they get shit because there are mistakes in the redacting because there are 3 million pages of documents.

show 7 replies
IshKebabtoday at 2:22 PM

Disappointing how terrible open source OCR still is.

prettywomanyesterday at 11:19 PM

[dead]

heraldgeezertoday at 7:27 AM

[flagged]

show 1 reply