An ode to bzip

167 points • by signa11 • yesterday at 4:01 PM • 83 comments • view on HN

Comments

My experience does not match theirs when compressing text and code:

> bzip might be suboptimal as a general-purpose compression format, but it’s great for text and code. One might even say the b in bzip stands for “best”.

I've just checked again with a 1GB SQL file. `bzip2 -9` shrinks it to 83MB. `zstd -19 --long` to 52MB.

Others have compressed the Linux kernel and found that bzip2's is about 15% larger than zstd's.

➕ show 6 replies

saghm • yesterday at 5:47 PM

Early on the article mentions that xz have zstd have gotten more popular than bzip, and my admitted naive understanding is that they're considered to have better tradeoffs in teems of collision compression time and overall space saved by compression. The performance section heavily discusses encoding performance of gzip and bzip, but unless I'm missing something, the only references to xz or zstd in that section are briefly handwaving about the decoding times probably being similar.

My impression is that this article has a lot of technical insight into how bzip compares to gzip, but it fails actually account for the real cause of the diminished popularity of bzip in favor of the non-gzip alternatives that it admits are the more popular choices in recent years.

➕ show 2 replies

thesz • yesterday at 9:56 PM

BWT is a prediction by partial match (PPM) in disguise.

Consider "bananarama":

  "abananaram"
  "amabananar"
  "ananaramab"
  "anaramaban"
  "aramabanan"
  "bananarama"
  "mabananara"
  "nanaramaba"
  "naramabana"
  "ramabanana"

The last symbols on each line get context from first symbols of the same line. It is so due to rotation.

But, due to sorting, contexts are not contiguous for the (last) character predicted and long dependencies are broken. Because of broken long dependencies, it is why MTF, which implicitly transforms direct symbols statistics into something like Zipfian [1] statistics, does encode BWT's output well.

[1] https://en.wikipedia.org/wiki/Zipf%27s_law

Given that, author may find PPM*-based compressors to be more compression-wise performant. Large Text Compression Benchmark [2] tells us exactly that: some "durilka-bububu" compressor that uses PPM fares better than BWT, almost by third.

fl0ki • yesterday at 6:04 PM

This seems as good a thread as any to mention that the gzhttp package in klauspost/compress for Go now supports zstd on both server handlers and client transports. Strangely this was added in a patch version instead of a minor version despite both expanding the API surface and changing default behavior.

https://github.com/klauspost/compress/releases/tag/v1.18.4

➕ show 1 reply

hexxagone • yesterday at 6:00 PM

Notice that bzip3 has close to nothing to do with bzip2. It is a different BWT implementation with a different entropy codec, from a different author (as noted in the GitHub description "better and stronger spiritual successor to BZip2").

➕ show 1 reply

eichin • yesterday at 10:50 PM

Interesting detail on the algorithm but seems to completely miss that if you care about non-streaming performance, there are parallel versions of xz and gzip (pxzip encodes compatible metadata about the breakup points so that while xz can still decompress it, pxzip can use as many cores as you let it have instead.) Great for disk-image OS installers (the reason I was benchmarking it in the first place - but this was about 5 years back, I don't know if those have gotten upstreamed...)

➕ show 1 reply

joecool1029 • yesterday at 5:43 PM

Just use zstd unless you absolutely need to save a tiny bit more space. bzip2 and xz are extremely slow to compress.

➕ show 4 replies

pella • yesterday at 6:21 PM

imho: the future is a specialized compressor optimized for your specific format. ( https://openzl.org/ , ... )

➕ show 3 replies

cobbzilla • yesterday at 9:26 PM

My first exposure to bzip: The first Linux kernels I ever compiled & built myself (iirc ~v2.0.x), I packed as .tar.bz2 images. Ah the memories.

Yes, there are better compression options today.

vintermann • yesterday at 9:34 PM

If you're implementing it for Computercraft anyway, there's no reason to stick to the standard. It's well known that bzip2 has a couple of extra steps which don't improve compression ratio at all.

I suggest implementing Scott's Bijective Burrows-Wheeler variant on bits rather than bytes, and do bijective run-length encoding of the resulting string. It's not exactly on the "pareto frontier", but it's fun!

Grom_PE • yesterday at 6:53 PM

PPMd (of 7-Zip) would beat BZip2 for compressing plain text data.

jiggawatts • today at 9:55 AM

I recently did some works on genomics / bioinformatics, where terabyte-sized text datasets are common and often compressed and decompressed multiple times in some workloads. This often becomes the serial bottleneck we all know and love from Amdahl's law.

I ran a bunch of benchmarks, and found that the only thing that mattered was if a particular tool or format supported parallel compression and/or parallel decompression. Nothing else was even close as a relevant factor.

If you're developing software for processing even potentially large files and you're using a format that is inherently serial, you've made a mistake. You're wasting 99.5% of a modern server's capacity, and soon that'll be 99.9%.

It really, really doesn't matter if one format is 5% faster or 5% bigger or whatever if you're throwing away a factor of 200 to 1,000 speedup that could be achieved through parallelism! Or conversely, the ability to throw up to 1,000x the compute at improving the compression ratio in the available wall clock time.

Checksumming, compression, decompression, encryption, and decryption over bulk data must all switch to fully parallel codecs, now. Not next year, next decade, or the year after we have 1,000+ core virtual machines in the cloud available on a whim. Oh wait, we do already: https://learn.microsoft.com/en-us/azure/virtual-machines/siz...

Not to mention: 200-400 Gbps NICs are standard now in all HPC VMs and starting to become commonplace in ordinary "database optimised" cloud virtual machines. Similarly, local and remote SSD-backed volumes that can read and write at speeds north of 10 GB/s.

There are very few (any?) compression algorithms that can keep up with a single TCP/IP stream on a data centre NIC or single file read from a modern SSD unless using parallelism. Similarly, most CPUs struggle to perform even just SHA or AES at those speeds on a single core.

➕ show 1 reply

elophanto_agent • yesterday at 5:15 PM

bzip2 is the compression algorithm equivalent of that one coworker who does incredible work but nobody ever talks about. meanwhile gzip gets all the credit because it's "good enough"

alt Hacker News

An ode to bzip

Comments