Wait, so I create a new file for every message?

aynyc • 05/05/2025 • 1 reply • view on HN

Replies

wenc • 05/05/2025

Typically small data is batched. While theoretically you could, I wouldn't create 1 file per row (there would be too many files and your filesystem would struggle). But maybe you can batch 1 day's worth of data (or whatever partitioning works for your data) and write to 1 parquet file?

For example, my data is usually batched by yearwk (year + week no), so my directory structure looks like this:

  /data/yearwk=202501/000.parquet
  /data/yearwk=202502/000.parquet

This is also called the Hive directory structure. When I query, I just do:

  select * from '/data/**/*.parquet';

This is a paradigm shift from standard database thinking for handling truly big data. It's append-only by file.

500GB in CSVs doesn't sound that big though. I'm guessing when you convert to Parquet (a 1-liner in DuckDB, below) it might end up being 50GBs or so.

  COPY (FROM '/data/*.csv') TO 'my.parquet' (FORMAT PARQUET);

➕ show 1 reply

alt Hacker News

Replies