I recently did some works on genomics / bioinformatics, where terabyte-sized text datasets are ...

jiggawatts • today at 9:55 AM • 1 reply • view on HN

I recently did some works on genomics / bioinformatics, where terabyte-sized text datasets are common and often compressed and decompressed multiple times in some workloads. This often becomes the serial bottleneck we all know and love from Amdahl's law.

I ran a bunch of benchmarks, and found that the only thing that mattered was if a particular tool or format supported parallel compression and/or parallel decompression. Nothing else was even close as a relevant factor.

If you're developing software for processing even potentially large files and you're using a format that is inherently serial, you've made a mistake. You're wasting 99.5% of a modern server's capacity, and soon that'll be 99.9%.

It really, really doesn't matter if one format is 5% faster or 5% bigger or whatever if you're throwing away a factor of 200 to 1,000 speedup that could be achieved through parallelism! Or conversely, the ability to throw up to 1,000x the compute at improving the compression ratio in the available wall clock time.

Checksumming, compression, decompression, encryption, and decryption over bulk data must all switch to fully parallel codecs, now. Not next year, next decade, or the year after we have 1,000+ core virtual machines in the cloud available on a whim. Oh wait, we do already: https://learn.microsoft.com/en-us/azure/virtual-machines/siz...

Not to mention: 200-400 Gbps NICs are standard now in all HPC VMs and starting to become commonplace in ordinary "database optimised" cloud virtual machines. Similarly, local and remote SSD-backed volumes that can read and write at speeds north of 10 GB/s.

There are very few (any?) compression algorithms that can keep up with a single TCP/IP stream on a data centre NIC or single file read from a modern SSD unless using parallelism. Similarly, most CPUs struggle to perform even just SHA or AES at those speeds on a single core.

Replies

hexxagone • today at 12:53 PM

Then you should take a look at https://github.com/flanglet/kanzi-cpp: it is optimized for fast roundtrips, multi-threaded by design and produces a seekable bitstream.

alt Hacker News

Replies