Fuzzy matching a string in SQL

Question

I have a User table, that has id, first_name, last_name, street_address, city, state, zip-code, firm, user_identifier, created_at, update_at.

This table has a lot of duplication like the same users have been entered multiple times as a new user, so example


id  first_name  last_name  street_address  user_identifier
---------------------------------------------------------
11   Mary       Doe        123 Main Ave     M2111111
---------------------------------------------------------
21  Mary        Doe        123 Main Ave     M2344455
---------------------------------------------------------
13  Mary Esq    Doe        123 Main Ave     M1233444

I would like to know if there is a way of doing fuzzy matching on this table.

Basically I would like to find all the users that have the same name, same address but can be with a slight difference, maybe the address is the same but has different apartment number, or has a middle name and the other duplicates don't.

I was thinking to create a new column that has concatenated first_name, last_name, street_address and do a fuzzy match on that column.

I tried levenshtein distance on concatenated first_name and last_name as full_name but doesn't seem to catch up the name that has a middle name

select * from users
where levenshtein('Mary Doe', full_name) <=1;

I am using a Databricks and PostgreSQL.

Thank you!

"fuzzy matching" is rather unclear. Please explain the logic that you want to implement. Otherwise the question is just too vague. — Gordon Linoff, Nov 13 '19 at 15:08
@Andronicus I have tried that, but I need to find matches on the full user information, like first, last and adress. — Mariana, Nov 13 '19 at 15:35
@MarianaLungu if you have long strings, you should pick a larger number then or relative distance, check out my edit. — Andronicus, Nov 13 '19 at 15:47

Andronicus · Accepted Answer · 2019-11-13T15:46:13.817

2

In postgres you can use fuzzystrmatch package. It provies a levenshtein function, that returns distance between two texts, you can then perform fuzzy matching with the following exemplary predicate:

where levenshtein(street_address, '123 Main Avex') <= 1

This will match all records, because the distance between '123 Main Ave' and '123 Main Avex' is 1 (1 insertion).

Of course, value 1 here is just an example and will perform matching quite strictly (difference by only one character). You should either use larger number or, what @IVO GELOV sugests - use relative distance (distance divided by the length).

edited Nov 13 '19 at 15:46

answered Nov 13 '19 at 15:11

Andronicus

25,419
17
47
88

1

This is a partial solution - what the author really needs/wants is not the absolute Levenshtein distance but the relative distance between 0.0 and 1.0. Something like `levenshtein('Mary Lou', full_name) / greatest(length('Mary Lou'), length(full_name))` – IVO GELOV Nov 13 '19 at 15:38
1

@IVOGELOV you're right, other solution would be to experiment with larger values of distance, but that might be more "rigid". Thank you very much for that comment! – Andronicus Nov 13 '19 at 15:54

Morris de Oryx · Answer 2 · 2019-11-14T03:35:40.213

If you get to the point where Levenshtein ("edit distance") isn't capturing all of the matches you need, I'd strongly encourage you to check out pg_tgrm. It. Is. Awesome.

postgresql.org/docs/current/pgtrgm.html.

As an example of why to use trigrams, they let you pick up cases where first_name and last_name are reversed, a relatively common error. Levenshtein isn't well matched to spotting that as all it does is transform the one string into another, and count the number of moves required. When you've got elements swapped, they increase the distance quite a bit and make the match less likely. As an example, pretend that you have a record where the right full name is "David Adams". It's pretty common to find the last name as "Adam", and to find first and last names reversed. So, that's three plausible forms for a simple name. How does Levenshtein perform compared with the Postgres trigram implementation? For this, I compared levenshtein(string 1, string 2) with similarity(string 1, string 2). As noted above, Levenshtein is a count where a higher score means less similar. To normalize the scores to a 0-1 value where 1 = identical, I divided it by the max full name length, as suggested above, and subtracted it from 1. That last bit is to make the figures directly comparable to a similarity() score. (Otherwise, you've got numbers where 1 means opposite things.)

Here's are some simple results, rounded a bit for clarity

Row 1           Row 2        Levenshtein()  Levensthein %   Similarity %
David Adams     Adam David              10              9             77
Adam David      Adams David              1             91             77
Adams David     David Adams             10              9            100

As you can see, the similarity() score performs better in a lot of cases, even with this simple example. Then again, Levenshtein feels better in one case. It's not rare to combine techniques. If you do that, normalize the scales to save yourself some headache.

But all of this is made a lot easier if you've got cleaner data to start with. If one of your problems is with inconsistent abbreviations and punctuation, Levenshtein can be a poor match. For this reason, it's helpful to perform address standardization before duplicate matching.

For what it's worth (a lot), trigrams in Postgres can use indexes. It can be a good bet to try and find a technique to safely reduce candidates with an indexed search before performing an more expensive comparison with something like Levenshtein. Ah, and a trick for Levenshtein is that if you have a target/tolerance, and have the length of your strings stored, you can exclude strings that are too short or long off that stored length without running the more expensive fuzzy comparison. If, for example, you have a starting string of length 10 and only want strings that are at most 2 transformations away, you're wasting your time to test strings that are only 7 characters long etc.

Note that the bad data input problem you describe often comes down to

poor user training and/or
poor UX

It's worth reviewing how bad data is getting in, once you've got your cleanup in good order. If you have a finite set of trainable users, it can help to run a nightly (etc.) scan to detect new likely duplicates, and then go and talk to whoever is generating them. Maybe there's something they don't know you can tell them, maybe there's a problem in the UI that you don't know that they can tell you.

I actually did look at `pg_tgrm`, and you right there was a UX problem that let the users enter duplicates. Can you please specify if you don't mind, how would I do that in SQL? — Mariana, Nov 14 '19 at 14:57
Maybe start another question about your UX challenges? If would help to hear about your setup. Otherwise, advice is going to be pretty general, and likely far from the mark. — Morris de Oryx, Nov 14 '19 at 19:46
@MorrisdeOryx Can you please have a look at this post [POST](https://stackoverflow.com/questions/60540440/finding-duplicate-profile-in-pyspark-using-soundex-function-or-levenshtein-dista) — aamirmalik124, Mar 05 '20 at 10:17

score 0 · Answer 3 · answered Nov 17 '19 at 03:58

There is a like operator. Have you considered trying that?

The following SQL statement selects all customers with a CustomerName starting with "a": Example

SELECT * FROM Customers
WHERE CustomerName LIKE 'a%';

https://www.w3schools.com/sql/sql_like.asp

Fuzzy matching a string in SQL

3 Answers3

Linked