Did Athena on CSV work for you? I've used Athena and it struggles with CSV at scale too.
Btw I'm not suggesting to use Spark. I'm saying that even Spark didn't work on large TSV datasets (it only takes a JOIN or GROUP BY to kill the query performance). The CSV data storage format is simply the wrong one for analytics.
Partitioning is irreversible, but coming up with a thoughtful scheme isn't that hard. You just need to hash something. Even something as simple as a HNV hash on some meaningful field is sufficient. In one of my datasets, I chunk it by week, then by HNV modulo 50 chunks, so it looks like this:
/yearwk=202501/chunk=24/000.parquet
Ask an LLM to suggest partioning scheme or think of one.
CSV is the mistake. The move here is to get out of CSV. Partitioning is secondary -- partitioning here is only used for chunking the Parquet, nothing else. You are not locked into anything.
Did Athena on CSV work for you? I've used Athena and it struggles with CSV at scale too.
Btw I'm not suggesting to use Spark. I'm saying that even Spark didn't work on large TSV datasets (it only takes a JOIN or GROUP BY to kill the query performance). The CSV data storage format is simply the wrong one for analytics.
Partitioning is irreversible, but coming up with a thoughtful scheme isn't that hard. You just need to hash something. Even something as simple as a HNV hash on some meaningful field is sufficient. In one of my datasets, I chunk it by week, then by HNV modulo 50 chunks, so it looks like this:
/yearwk=202501/chunk=24/000.parquet
Ask an LLM to suggest partioning scheme or think of one.
CSV is the mistake. The move here is to get out of CSV. Partitioning is secondary -- partitioning here is only used for chunking the Parquet, nothing else. You are not locked into anything.