logoalt Hacker News

Fast and Easy Levenshtein distance using a Trie (2011)

64 pointsby sebglast Sunday at 12:00 PM10 commentsview on HN

Comments

sminchevtoday at 10:17 AM

A few years ago, before the AI boom I needed to create a de-duplication app, as a PoC. To be able to compare fast millions of contact data and to search for the duplicates. The clients' approach was taking, in best case, a day to compare everything and generate a report.

What we do was a combination of big data engine, like Apache Spark, a few comparison algorithms like Levenshtein, and ML. AI was not treated as an option to do such things at all! :)

What we did was to use Apache Spark to apply the static algorithms, if we get confident results like less than 10% equality or more than 90% of equality, we treated those as sure signs for records be duplicated or not. Records that were somewhere in the middle, we sent to Machine Learning libraries for analysis. Of course some education was needed for statistical basis. And hard to be automatically analyzed, we placed in a report for human touch ;)

We got relatively good results. It was a Scala based app, as far as I remember :)

Now with AI, it is much more easy... And boring! :D No complexities, no challenges.

show 1 reply
kristianptoday at 9:57 AM

The author also has an interesting discussion of Succinct Data Structures at https://stevehanov.ca/blog/succinct-data-structures-cramming...

dvhtoday at 6:51 AM

I made myself plugin that shows new news in wikipedia's current event page and I was using levenshtein originally (they often edit couple of words in article over span of few days so I compare each new article with previous ones for similarity) but after few days it became too slow (~20s) because O(m*n), so I switched to sorensen-dice instead which is O(m+n) and it's much faster and works very similar way, even tho they do slightly different thing.

kelseydhtoday at 7:16 AM

I needed a fuzzy string matching algorithm for finding best name matches among a candidate list. Considered Normalized Levenshtein Distance but ended up using Jaro-Winkler. I'm curious if anybody has good resources on when to use each fuzzy string matching algorithm and when.

localhostertoday at 7:53 AM

This article surface every once in a while, and I love it. What the author suggests is very clever. I have implemented an extended version of that in Go as an experiment. Instead of using a trie, I used a radix tree. Functions the same, but it's much more compressed (and faster).

fergietoday at 6:29 AM

Very cool and satisfying.

gregman1today at 7:43 AM

2011

show 1 reply
consomidatoday at 9:31 AM

Using a trie to calculate Levenshtein distance is such a clever optimization. Clear explanation and practical examples make it easy to understand and implement