logoalt Hacker News

kelseydhtoday at 7:16 AM2 repliesview on HN

I needed a fuzzy string matching algorithm for finding best name matches among a candidate list. Considered Normalized Levenshtein Distance but ended up using Jaro-Winkler. I'm curious if anybody has good resources on when to use each fuzzy string matching algorithm and when.


Replies

vintermanntoday at 10:33 AM

Levenshtein distance is rarely the similarity measure you need. Words usually mean something, and it's usually the distance in meaning you need.

As usual, examples from my genealogy hobby: many sites allow you to upload your family tree as a gedcom file and compare it to other people's trees or a public tree. Most of these use Levenshtein distance on names to judge similarity, and it's terrible. Anne Nilsen and Anne Olsen could be the same person, right? No!! These tools are unfortunately useless to me because they give so many false positives.

These days, an embedding model is the way to go. Even a small, bad embedding model is better than Levenshtein distance if you care about the meaning of the string.

RobinLtoday at 11:03 AM

There's a section in the docs of our FOSS record linkage software that covers this: https://moj-analytical-services.github.io/splink/topic_guide...