My Hacker News items table in ClickHouse has 47,428,860 items, and it's 5.82 GB compressed and 18.18 GB uncompressed. What makes Parquet compression worse here, when both formats are columnar?
The best source for this data used to be Clickhouse (https://play.clickhouse.com/play?user=play#U0VMRUNUIG1heCh0a...), but it hasn't updated since 2025-12-26.
What happened between 2023 and 2024 to cause the usage dropoff?
Static web content and dynamic data?
> The archive currently spans from 2006-10 to 2026-03-16 23:55 UTC, with 47,358,772 items committed.
That’s more than 5 minutes ago by a day or two. No big deal, but a little bit depressing this is still how we do things in 2026.
Please upload to https://academictorrents.com/ as well if possible
Under the Known Limitations section
deleted and dead are integers. They are stored as 0/1 rather than booleans.
Is there a technical reason to do this? You have the type right there.Is is possible to only download a subset? e.g. Show HNs or HN Whoishiring. The Show HNs and HN Whoishiring are very useful for classroom data science i.e. a very useful set of data for students to learn the basic of data cleaning and engineering.
is the legal page a placeholder, do words have no meaning?
https://www.ycombinator.com/legal/
Mods, enforce your license terms, you're playing fast and loose with the law (GDPR/CPRA)
what’s the license? “do whatever the fuck you want with the data as long as you don’t get caught”? or does that only work for massive corporations
> At midnight UTC, the entire current month is refetched from the source as a single authoritative Parquet file, and today's individual 5-minute blocks are removed from the today/ directory.
Wouldn't that lose deleted/moderated comments?
Good for demo but every 5 minutes? Why?