logoalt Hacker News

Show HN: Hacker News archive (47M+ items, 11.6GB) as Parquet, updated every 5m

101 pointsby tamndlast Saturday at 5:12 PM36 commentsview on HN

Comments

kshackertoday at 7:04 PM

Good for demo but every 5 minutes? Why?

show 1 reply
gkbrktoday at 5:51 PM

My Hacker News items table in ClickHouse has 47,428,860 items, and it's 5.82 GB compressed and 18.18 GB uncompressed. What makes Parquet compression worse here, when both formats are columnar?

show 2 replies
xnxtoday at 5:56 PM

The best source for this data used to be Clickhouse (https://play.clickhouse.com/play?user=play#U0VMRUNUIG1heCh0a...), but it hasn't updated since 2025-12-26.

alstonitetoday at 6:52 PM

What happened between 2023 and 2024 to cause the usage dropoff?

show 2 replies
mlhpdxtoday at 6:15 PM

Static web content and dynamic data?

> The archive currently spans from 2006-10 to 2026-03-16 23:55 UTC, with 47,358,772 items committed.

That’s more than 5 minutes ago by a day or two. No big deal, but a little bit depressing this is still how we do things in 2026.

show 2 replies
tonymettoday at 6:58 PM

what's the license for HN content?

show 1 reply
lokimoontoday at 6:59 PM

You are the product

show 1 reply
lyu07282today at 6:56 PM

Please upload to https://academictorrents.com/ as well if possible

0cf8612b2e1etoday at 5:50 PM

Under the Known Limitations section

  deleted and dead are integers. They are stored as 0/1 rather than booleans.
Is there a technical reason to do this? You have the type right there.
Onavotoday at 5:22 PM

Is is possible to only download a subset? e.g. Show HNs or HN Whoishiring. The Show HNs and HN Whoishiring are very useful for classroom data science i.e. a very useful set of data for students to learn the basic of data cleaning and engineering.

show 1 reply
GeoAtreidestoday at 5:42 PM

is the legal page a placeholder, do words have no meaning?

https://www.ycombinator.com/legal/

Mods, enforce your license terms, you're playing fast and loose with the law (GDPR/CPRA)

show 4 replies
bstsbtoday at 5:27 PM

what’s the license? “do whatever the fuck you want with the data as long as you don’t get caught”? or does that only work for massive corporations

show 1 reply
palmoteatoday at 5:34 PM

> At midnight UTC, the entire current month is refetched from the source as a single authoritative Parquet file, and today's individual 5-minute blocks are removed from the today/ directory.

Wouldn't that lose deleted/moderated comments?

show 1 reply