logoalt Hacker News

phireskyyesterday at 12:53 PM5 repliesview on HN

I'm a bit disappointed that this only solves the "find index of file in tar" problem, but not at all the "partially read a tar.gz" file problem. So really you're still reading the whole file into memory, so why not just extract the files properly while you are doing that? Takes the same amount of time (O(n)) and less memory.

The gzip-random-access problem one is a lot more difficult because the gzip has internal state. But in any case, solutions exist! Apparently the internal state is only 32kB, so if you save this at 1MB offsets, you can reduce the amount of data you need to decompress for one file access to a constant. https://github.com/mxmlnkn/ratarmount does this, apparently using https://github.com/pauldmccarthy/indexed_gzip internally. zlib even has an example of this method in its own source tree: https://github.com/gcc-mirror/gcc/blob/master/zlib/examples/...

All depends on the use case of course. Seems like the author here has a pretty specific one - though I still don't see what the advantage of this is vs extracting in JS and adding all files individually to memfs. "Without any copying" doesn't really make sense because the only difference is copying ONE 1MB tar blob into a Uint8Array vs 1000 1kB file blobs

One very valid constraint the author makes is not being able to touch the source file. If you can do that, there's of course a thousand better solutions to all this - like using zip, which compresses each file individually and always has a central index at the end.


Replies

ImJasonHyesterday at 2:01 PM

The first time I'd heard of this was via https://github.com/jonjohnsonjr/dagdotdev/blob/main/internal... which powers https://oci.dag.dev to let you browse OCI images (e.g., https://oci.dag.dev/fs/ubuntu@sha256:b40150c1c2717d324cdb172...)

show 1 reply
mxmlnknyesterday at 4:07 PM

> Apparently the internal state is only 32kB

Exactly. And often this state is either highly compressible or non-compressible but only sparsely used. The latter can then be made compressible by replacing the unused bytes with zeros.

Ratarmount uses indexed_gzip, and when parallelization makes sense, it also uses rapidgzip. Rapidgzip implements the sparsity analysis to increase compressibility and then simply uses the gztool index format, i.e., compresses each 32 KiB using gzip itself, with unused bytes replaced with zeros where possible.

indexed_gzip, gztool, and rapidgzip all support seeking in gzip streams, but all have some trade-offs, e.g., rapidgzip is parallelized but will have much higher memory usage because of that than indexed_gzip or gztool. It might be possible to compile either of these to WebAssembly if there is demand.

nrclarktoday at 6:33 AM

Tar doesn't need to imply gzip (or bzip2, or zstd, etc). Tar's default operation produces uncompressed archives.

vlovich123yesterday at 2:11 PM

> Each seek point is accompanied by a chunk (32KB) of uncompressed data which is used to initialise the decompression algorithm, allowing us to start reading from any seek point.

> Apparently the internal state is only 32kB, so if you save this at 1MB offsets, you can reduce the amount of data you need to decompress for one file access to a constant.

You may need to revisit the definition of a constant. A 1/32 additional data is small but it still grows the more data you’re trying to process (we call that O(n) not O(1)). Specifically it’s 3% and so you generally want to target 1% for this kind of stuff (once every 3mib)

And the process still has to read though the enter gzip once to build that index

show 1 reply
a_t48yesterday at 7:07 PM

If anyone knows a similar solution for zstd, I'm very interested. I'm doing streaming uncompression to disk and I'd like to be able to do resumable downloads without _also_ storing the compressed file.

show 1 reply