logoalt Hacker News

aynyc05/05/20251 replyview on HN

I was really surprised duckdb choked on 500GB. That's maybe a week's worth of data.

The partitioning of partquet files might be an issue as not all data are neatly partitioned by date. We have trades with different execution dates, clearance dates and other date values that we need query on.


Replies

wenc05/05/2025

It doesn’t usually choke on 500 gb of data. I query 600 gb (equivalent to a few TBs of CSVs?) of parquets daily. It’s not the size of the data. It’s the type of data.

If date partitioning doesn’t work, just find another chunking key. The key is to get it into parquet format. CSV is just hugely inefficient.

Or spin up a larger compute instance with more memory. I have 256gb on mine.

I tried running an Apache Spark job (8 machine cluster) on a data lake of 300 Gb of TSVs once. This was a distributed cluster. There was one join in it. It timed out after 8 hours. I realized why — Spark had to do many full table scans of the TSVs and it was just so inefficient. CSV formats are ok for straight up reads, but any time you have to do analytics operations like aggregate or join them at scale, you’re in for a world of pain.

DuckDB has better CSV handling than Spark but a large dataset in a poor format will stymie any engine.

show 1 reply