logoalt Hacker News

cwilluyesterday at 8:00 PM2 repliesview on HN

It's a simple format easily suitable for accessing resources by byte ranges, that everyone has tooling for, and which _doesn't_ compress things.


Replies

rpdillonyesterday at 9:03 PM

It would be interesting to extend it to zip, which is what redbean/greenbean use to serve static assets.

Back in school, I worked on a project called Velox, with a partner - the idea was to take a bz2-compressed dump of the giant XML export of wikipedia, and write a program to serve that copy of wikipedia from disk (this was in 2008-2010? in my master's program, so before Kiwix and the amazing zim dumps they produce). My partner worked on the UI and indexing, and I was focusing on how to parse the bz2 compression format to locate article boundaries in the (giant) XML dump that Wikipedia provides. I ended up putting a lot of time into it because it was a bunch of fun.

Writing this just sent me back to the presentation I made. The slide I wrote back then said:

> Significant original work went into creation of archive access. The Apache BZip2 library that is part of Ant was used as a basis for archive access.

> Modified to support random access to a given byte/bit offset pair within the compressed data stream (BZip2 is not a byte-aligned format) > Extended to index all BZip2 block positions, allowing Java-based pseudo-random access to BZip2 compressed data > Extended to map article IDs to block numbers for constant-time article retrieval, even in BZip2 archives exceeding 5GB in size

> Current article retrieval times are ~2 seconds.

This is back when the archive was ~7GB IIRC. My Kiwix dumps today are ~120GB, but that includes images.

This is the link to the presentation in Google Slides that we wrote back in 2008 or so. The version history shows 2013, but I think some kind of import/conversion happened around that time.

https://docs.google.com/presentation/d/e/2PACX-1vTfrxEqvHbd0...

show 2 replies
hackrmntoday at 10:03 AM

As opposed to, I don't know, a _file system_?