I had to look up what Arrow actually does, and I might have to run some performance comparisons vs s...

actionfromafar • yesterday at 3:26 PM • 4 replies • view on HN

I had to look up what Arrow actually does, and I might have to run some performance comparisons vs sqlite.

It's very neat for some types of data to have columns contiguous in memory.

Replies

skeeter2020 • yesterday at 3:44 PM

>> some performance comparisons vs sqlite.

That's not really the purpose; it's really a language-independent format so that you don't need to change it for say, a dataframe or R. It's columnar because for analytics (where you do lots of aggregations and filtering) this is way more performant; the data is intentionally stored so the target columns are continuous. You probably already know, but the analytics equivalent of SQLite is DuckDB. Arrow can also eliminate the need to serialize/de-serialize data when sharing (ex: a high performance data pipeline) because different consumers / tools / operations can use the same memory representation as-is.

➕ show 2 replies

nu11ptr • yesterday at 3:31 PM

If I recall, Arrow is more or less a standardized representation in memory of columnar data. It tends to not be used directly I believe, but as the foundation for higher level libraries (like Polars, etc.). That said, I'm not an expert here so might not have full info.

➕ show 1 reply

tosh • yesterday at 4:31 PM

Take a look at parquet.

You can also store arrow on disk but it is mainly used as in-memory representation.

data_ders • yesterday at 3:35 PM

yeah not necessarily compute (though it has a kernel)!

it's actually many things IPC protocol wire protocol, database connectivity spec etc etc.

in reality it's about an in-memory tabular (columnar) representation that enables zero copy operations b/w languages and engines.

and, imho, it all really comes down to standard data types for columns!

alt Hacker News

Replies