The mathematics of compression in database systems

47 points • by agavra • last Tuesday at 5:07 PM • 10 comments • view on HN

Comments

Arithmetic coding of a single bit preserves ordering of encoded bits, if CDF(1) > CDF(0). If byte's encoding process is going from higher bits to lower bits, arithmetic coding (even with dynamic model) will preserve ordering of individual bytes.

In the end, arithmetic coding preserves ordering of encoded strings. Thus, comparison operations can be performed on the compressed representation of strings (and big-endian representations of integers and even floating point values), without the need to decompress data until that decompressed strings are needed.

Another view: strings are compared by memcmp as if they are mantissas with the base 256. "hi!" is 'h'(1/256)+'i'(1/256)^2+'!'(1/256)^3+0(1/256)^4 and then there are zeroes to the infinity. Arithmetic encoding represents encoded strings as mantissas where base is 2. Range coding can utilize other bases such as 256.

srean • today at 3:49 PM

Then there is this eternal conversation about whether on should encrypt and then compress or compress and then encrypt.

Encrypted data will not compress well because encryption needs to remove patterns and patterns are what one exploits for compression.

If you compress and then encrypt, yes you can leak information through the file sizes, but there isn't really a way out of this. Encryption and compression are fundamentally at odds with each other.

➕ show 1 reply

ozgrakkurt • today at 5:15 PM

Really interesting.

I was trying to implement a compression algorithm selection heuristic in some file format code I am developing. I found this to be too hard for me to reason about so basically gave up on it.

Feels like this blog post is getting there but there could be a more detailed sets of equations that actually calculate this based on some other parameters.

Having the code completely flexible and doing a full load production test with desired parameters to find the best tuning is an option but is also very difficult.

Also read this previously which I find similar.

https://rocksdb.org/blog/2021/12/29/ribbon-filter.html

esafak • today at 3:34 PM

Your SSL cert is invalid.

alt Hacker News

The mathematics of compression in database systems

Comments