It would be interesting to extend it to zip, which is what redbean/greenbean use to serve static assets.
Back in school, I worked on a project called Velox, with a partner - the idea was to take a bz2-compressed dump of the giant XML export of wikipedia, and write a program to serve that copy of wikipedia from disk (this was in 2008-2010? in my master's program, so before Kiwix and the amazing zim dumps they produce). My partner worked on the UI and indexing, and I was focusing on how to parse the bz2 compression format to locate article boundaries in the (giant) XML dump that Wikipedia provides. I ended up putting a lot of time into it because it was a bunch of fun.
Writing this just sent me back to the presentation I made. The slide I wrote back then said:
> Significant original work went into creation of archive access. The Apache BZip2 library that is part of Ant was used as a basis for archive access.
> Modified to support random access to a given byte/bit offset pair within the compressed data stream (BZip2 is not a byte-aligned format)
> Extended to index all BZip2 block positions, allowing Java-based pseudo-random access to BZip2 compressed data
> Extended to map article IDs to block numbers for constant-time article retrieval, even in BZip2 archives exceeding 5GB in size
> Current article retrieval times are ~2 seconds.
This is back when the archive was ~7GB IIRC. My Kiwix dumps today are ~120GB, but that includes images.
This is the link to the presentation in Google Slides that we wrote back in 2008 or so. The version history shows 2013, but I think some kind of import/conversion happened around that time.
It would be interesting to extend it to zip, which is what redbean/greenbean use to serve static assets.
Back in school, I worked on a project called Velox, with a partner - the idea was to take a bz2-compressed dump of the giant XML export of wikipedia, and write a program to serve that copy of wikipedia from disk (this was in 2008-2010? in my master's program, so before Kiwix and the amazing zim dumps they produce). My partner worked on the UI and indexing, and I was focusing on how to parse the bz2 compression format to locate article boundaries in the (giant) XML dump that Wikipedia provides. I ended up putting a lot of time into it because it was a bunch of fun.
Writing this just sent me back to the presentation I made. The slide I wrote back then said:
> Significant original work went into creation of archive access. The Apache BZip2 library that is part of Ant was used as a basis for archive access.
> Modified to support random access to a given byte/bit offset pair within the compressed data stream (BZip2 is not a byte-aligned format) > Extended to index all BZip2 block positions, allowing Java-based pseudo-random access to BZip2 compressed data > Extended to map article IDs to block numbers for constant-time article retrieval, even in BZip2 archives exceeding 5GB in size
> Current article retrieval times are ~2 seconds.
This is back when the archive was ~7GB IIRC. My Kiwix dumps today are ~120GB, but that includes images.
This is the link to the presentation in Google Slides that we wrote back in 2008 or so. The version history shows 2013, but I think some kind of import/conversion happened around that time.
https://docs.google.com/presentation/d/e/2PACX-1vTfrxEqvHbd0...