Thank you Michael Rabin for your excellent work. Rest in Peace.
Rabin Fingerprinting is one of my favorites of his contributions. It's a "rolling hash" that allows you to quickly compute a 32-bit (or larger) hash at *every* byte offset of a file. It is used most notably to do file block matching/deduplication when those matching blocks can be at any offset. It's tragically underappreciated.
I've been meaning to write up a tutorial as part of my Galois Field series. Someday..
Thank you again!
Important to note that FastCDC is about an order of magnitude for block deduplication and is generally considered the state of art for such an approach (speed of computing the hash is more important than absolutely optimal distribution of hashes).
I'm working on a data annotation system based around Rabin fingerprints. They're a really neat idea.
I especially like how if you end up with hash characteristics that you don't like, your can just select a different irreducible Galois polynomial and now you've got a whole new hash algorithm. It's like tuning to a different frequency.
For me it means I don't have to worry about cases where there aren't enough nearby fingerprints for the annotation to adhere to, I can just add or remove polynomials until I get a good density.
I recently found his fingerprint algorithm and wrote a utility that uses it to find duplicate MIPS code for decompilation[0] and build unique identifiers that can be used to find duplicates without sharing any potentially copyrighted data[1].
This replaced some O(n²) searches through ASCII text, reducing search time from dozens of seconds to fractions of a second.
0 - https://github.com/ttkb-oss/mipsmatch 1 - https://github.com/ttkb-oss/mipsmatch/wiki/Identifiers