2

I have people names (first name, last name and surname) in db column. The data is not full, for example some rows

  • have only first name, last name or surname.
  • are in different order (surname, last name)
  • incorrectly spelled

I need an algorithm to display a set of rows in a group, that will suggest that it is the same person and I will go and manually delete them except one.

This data is very specific and the names are NOT repeated, so if we have John, Jonh Smihtm and John Smith, this is the same person for sure and I will go and manually delete all except the last one.

I need to display them in likelihood groups. So there should be a group that is very very likely that is the same person(John Smith, Jonh Smit), then there should be a set that are likely the same person (John, Johnny), and a set that maybe the same person(Jo, Jonathan).

I am relatively new to data mining and clustering, so please advise me some algorithms and what to get started with.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
user3410843
  • 195
  • 3
  • 18
  • interesting question, but probably a candidate for migration to stats. IMO the key problem here is to find a good model. – cel Feb 01 '15 at 09:27

1 Answers1

2

Do not use clustering. It will produce a lot of false positives. It will consider “Sam” and “Pam” highly similar.

Instead look at spelling correction, or define a Levenshtein distance threshold. But something that considers typo behavior will work even better than such a native letter approach .

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • But wouldn't a Levenshtein distance run into the same issue? Even with a threshold of 1, Jenny and Penny would be classified as the same. – user3932000 Jul 20 '19 at 04:49
  • Yes, that is why I'd go with a likely-typos approach. And the problem with clustering is that it will be used transitively. So Penny, Jenny, Jonny, Jonn, John, ... all the same then when clustering, even though total difference is almost the entire string. – Has QUIT--Anony-Mousse Jul 20 '19 at 07:58