How big are the data sets? I've been trying to get duckdb to work in our company on financial t...

aynyc • 05/04/2025 • 4 replies • view on HN

How big are the data sets? I've been trying to get duckdb to work in our company on financial transactions and reporting data. The dataset is around 500GB CSV in S3 and duckdb chokes on it.

Replies

wenc • 05/04/2025

CSV is a pretty bad format any engine will choke on it. It basically requires a full table scan to get at any data.

You need to convert it into Parquet or some columnar format that lets engines do predicate pushdowns and fast scans. Each parquet file stores statistics about the data it contains so engines can quickly decide if it’s worth reading the file or skipping it altogether.

nojito • 05/04/2025

CSV are a poor format to access from S3.

Should convert them to parquet then access and analytics becomes cheap and fast.

➕ show 1 reply

snake_doc • 05/04/2025

Are you querying from an EC2 instance close to the S3 data? Are the CSVs partitioned into separate files? Does the machine have 500GB of memory? It’s not always duckdb fault when there can be a clear I/O bottleneck…

➕ show 1 reply

higeorge13 • 05/04/2025

Could you test with clickhouse-local? It always works better for me.

➕ show 1 reply

alt Hacker News

Replies