Not a specific link, but a DuckDB table is a bag of rows (logically — it’s physically arranged as a ...

amluto • yesterday at 10:31 PM • 0 replies • view on HN

Not a specific link, but a DuckDB table is a bag of rows (logically — it’s physically arranged as a column store), and those rows are not ordered in a way that is expressed in the schema. If you do a big analytic query, DuckDB will (extremely efficiently) scan the whole table and will blow many other tools out of the water while doing so. But if you want to see the sensor value of a specific sensor at a specific time, you want an index of some sort, not a full table scan. And if you want to do a rollup of some but not all sensors, you end up modifying some stuff in the middle of a table, which is not amazingly efficient. DuckDB has an optional index, but I don’t think it’s meant for this.

You could certainly create a directory with a Parquet file for each (entity id, time range), and you could probably convince the DuckDB query engine to understand that (using Ducklake? raw Hive can only barely do this), but I don’t think that DuckDB will binary search for you. (And binary search is actually pretty lousy for this use case.)

Clickhouse has explicitly ordered tables:

https://clickhouse.com/docs/engines/table-engines/mergetree-...

alt Hacker News