How big are the data sets? I've been trying to get duckdb to work in our company on financial transactions and reporting data. The dataset is around 500GB CSV in S3 and duckdb chokes on it.
CSV are a poor format to access from S3.
Should convert them to parquet then access and analytics becomes cheap and fast.
Are you querying from an EC2 instance close to the S3 data? Are the CSVs partitioned into separate files? Does the machine have 500GB of memory? It’s not always duckdb fault when there can be a clear I/O bottleneck…
Could you test with clickhouse-local? It always works better for me.
CSV is a pretty bad format any engine will choke on it. It basically requires a full table scan to get at any data.
You need to convert it into Parquet or some columnar format that lets engines do predicate pushdowns and fast scans. Each parquet file stores statistics about the data it contains so engines can quickly decide if it’s worth reading the file or skipping it altogether.