logoalt Hacker News

mcvtoday at 9:08 AM4 repliesview on HN

Is everything becoming columnar? Parquet stores data per column instead of per row because it improves compression. I get that. Arrow apparently is columnar, and now DuckDB also gets its efficiency by treating data as columns instead of rows?

I still need to wrap my head around how that works, but it's a fascinating development.


Replies

levantentoday at 9:27 AM

It depends on your task. In analytics where you need to scan lots of data points within few columns, then columnar storage is very much the best. But for transactional workloads where you have to deal with specific entities, row based would be more advantageous. There are hybrid systems that try to be both at the same time but in my experience they end not doing either very well.

show 2 replies
charlieflowerstoday at 5:52 PM

BTW, columnar is very similar to struct of arrays (SOA) and some of the reasons it works well overlap with SOA.

skeeter2020today at 5:00 PM

compression is a side effect but not really the goal. To simplify, analytical queries often filter on a specific column value, and if these are laid out contiguously it makes disk-level reads much faster than rows that would involve read-skip-read-etc. In transactional systems data is typically written as rows though, so that's likely slower in a columnar system. As a general rule, heavy read workflows with known access patterns is going to benefit from a columnar layout.

squirrelloustoday at 2:13 PM

Those three things you mentioned kind of live in the same niche - offline data storage and querying. In that world yes everything has become columnar since it’s just better. Row-oriented is still the solution for online streaming use cases.