logoalt Hacker News

vlovich123yesterday at 2:11 PM1 replyview on HN

> Each seek point is accompanied by a chunk (32KB) of uncompressed data which is used to initialise the decompression algorithm, allowing us to start reading from any seek point.

> Apparently the internal state is only 32kB, so if you save this at 1MB offsets, you can reduce the amount of data you need to decompress for one file access to a constant.

You may need to revisit the definition of a constant. A 1/32 additional data is small but it still grows the more data you’re trying to process (we call that O(n) not O(1)). Specifically it’s 3% and so you generally want to target 1% for this kind of stuff (once every 3mib)

And the process still has to read though the enter gzip once to build that index


Replies

phireskyyesterday at 2:25 PM

I think you're looking at a different perspective than me. At _build time_ you need to process O(n), yes, and generate O(n) additional data. But I said "The amount of data you need to decompress is a constant". At _read time_, you need to do exactly three steps:

1. Load the file index - this one scales with the number of files unless you do something else smart and get it down to O(log(n)). This gives you an offset into the file. *That same offset/32 is an offset into your gzip index.*

2. take that offset, load 32kB into memory (constant - does not change by number of files, total size, or anything else apart from the actual file you are looking at)

3. decompress a 1MB chunk (or more if necessary)

So yes, it's a constant.

show 1 reply