Typically small data is batched. While theoretically you could, I wouldn't create 1 file per row (there would be too many files and your filesystem would struggle). But maybe you can batch 1 day's worth of data (or whatever partitioning works for your data) and write to 1 parquet file?
For example, my data is usually batched by yearwk (year + week no), so my directory structure looks like this:
Typically small data is batched. While theoretically you could, I wouldn't create 1 file per row (there would be too many files and your filesystem would struggle). But maybe you can batch 1 day's worth of data (or whatever partitioning works for your data) and write to 1 parquet file?
For example, my data is usually batched by yearwk (year + week no), so my directory structure looks like this:
This is also called the Hive directory structure. When I query, I just do: This is a paradigm shift from standard database thinking for handling truly big data. It's append-only by file.500GB in CSVs doesn't sound that big though. I'm guessing when you convert to Parquet (a 1-liner in DuckDB, below) it might end up being 50GBs or so.