logoalt Hacker News

aynyc05/05/20251 replyview on HN

Wait, so I create a new file for every message?


Replies

wenc05/05/2025

Typically small data is batched. While theoretically you could, I wouldn't create 1 file per row (there would be too many files and your filesystem would struggle). But maybe you can batch 1 day's worth of data (or whatever partitioning works for your data) and write to 1 parquet file?

For example, my data is usually batched by yearwk (year + week no), so my directory structure looks like this:

  /data/yearwk=202501/000.parquet
  /data/yearwk=202502/000.parquet
This is also called the Hive directory structure. When I query, I just do:

  select * from '/data/**/*.parquet';
This is a paradigm shift from standard database thinking for handling truly big data. It's append-only by file.

500GB in CSVs doesn't sound that big though. I'm guessing when you convert to Parquet (a 1-liner in DuckDB, below) it might end up being 50GBs or so.

  COPY (FROM '/data/*.csv') TO 'my.parquet' (FORMAT PARQUET);
show 1 reply