Hardwood: A New Parser for Apache Parquet

98 points • by rmoff • last Thursday at 3:31 PM • 11 comments • view on HN

Comments

Cool! I definitely felt the pain of current options when I added parquet support to Planetiler to process overture data. I ended up using parquet-floor to trim the dependencies but it’s a bit of a hacky approach. If there’s a way to use the lower level utilities from my own threads without hardwood spawning it’s own then I’ll have to give it a shot.

derriz • today at 1:01 PM

This sounds great. parquet-java is extremely unpleasant to use with its massive fan-out of dependencies, an awkward API which exposes these dependencies causing the dependencies to bleed into a user's code base - the Hadoop stuff is particularly annoying given the relatively poor quality (IMO) of the Hadoop code base and the amount of class name sharing with built in Java types (like File, FileSystem, etc.). And the performance of parquet-java is very poor compared to the libraries available to other languages.

➕ show 1 reply

willtemperley • today at 9:56 AM

Respect for doing this. I recently implemented a Parquet reader in Swift using parquet-java as a reference and it was by a long way the hardest bit of coding I’ve done. Your bit unpacking is interesting, is it faster then the 74 KLOC parquet-java bit unpacker?

➕ show 1 reply

jrjeksjd8d • today at 5:27 PM

Love to see Gunnar continuing to produce great stuff!

coredog64 • today at 7:08 PM

Excited to see this. Have some upcoming work projects that involve Parquet and Java. Fingers crossed I can get approval to use Java 21.

uwemaurer • today at 7:47 AM

Great! I will give it a try. I found that using DuckDB to select from the parquet files and using the Apache Arrow API to read the results is also a very fast method

xnx • today at 1:05 PM

Sounds great. No benchmarks?

➕ show 1 reply

alt Hacker News

Hardwood: A New Parser for Apache Parquet

Comments