If you get to the point where Levenshtein ("edit distance") isn't capturing all of the matches you need, I'd strongly encourage you to check out pg_tgrm. It. Is. Awesome.
postgresql.org/docs/current/pgtrgm.html.
As an example of why to use trigrams, they let you pick up cases where first_name
and last_name
are reversed, a relatively common error. Levenshtein isn't well matched to spotting that as all it does is transform the one string into another, and count the number of moves required. When you've got elements swapped, they increase the distance quite a bit and make the match less likely. As an example, pretend that you have a record where the right full name is "David Adams". It's pretty common to find the last name as "Adam", and to find first and last names reversed. So, that's three plausible forms for a simple name. How does Levenshtein perform compared with the Postgres trigram implementation? For this, I compared levenshtein(string 1, string 2)
with similarity(string 1, string 2)
. As noted above, Levenshtein is a count where a higher score means less similar. To normalize the scores to a 0-1 value where 1 = identical, I divided it by the max full name length, as suggested above, and subtracted it from 1. That last bit is to make the figures directly comparable to a similarity()
score. (Otherwise, you've got numbers where 1 means opposite things.)
Here's are some simple results, rounded a bit for clarity
Row 1 Row 2 Levenshtein() Levensthein % Similarity %
David Adams Adam David 10 9 77
Adam David Adams David 1 91 77
Adams David David Adams 10 9 100
As you can see, the similarity()
score performs better in a lot of cases, even with this simple example. Then again, Levenshtein feels better in one case. It's not rare to combine techniques. If you do that, normalize the scales to save yourself some headache.
But all of this is made a lot easier if you've got cleaner data to start with. If one of your problems is with inconsistent abbreviations and punctuation, Levenshtein can be a poor match. For this reason, it's helpful to perform address standardization before duplicate matching.
For what it's worth (a lot), trigrams in Postgres can use indexes. It can be a good bet to try and find a technique to safely reduce candidates with an indexed search before performing an more expensive comparison with something like Levenshtein. Ah, and a trick for Levenshtein is that if you have a target/tolerance, and have the length of your strings stored, you can exclude strings that are too short or long off that stored length without running the more expensive fuzzy comparison. If, for example, you have a starting string of length 10 and only want strings that are at most 2 transformations away, you're wasting your time to test strings that are only 7 characters long etc.
Note that the bad data input problem you describe often comes down to
- poor user training and/or
- poor UX
It's worth reviewing how bad data is getting in, once you've got your cleanup in good order. If you have a finite set of trainable users, it can help to run a nightly (etc.) scan to detect new likely duplicates, and then go and talk to whoever is generating them. Maybe there's something they don't know you can tell them, maybe there's a problem in the UI that you don't know that they can tell you.