How to return names"with different spelling" from dataframe

Question

As you know, A lot of names have multiple spellings.

I have a dataset that have first and last names, But i have an issue with spelling variations.

Here is a sample from the dataset :

    firstName  lastName
0    Ali        Khaled
1    Hamada     5ald
2    3ly        7mada
3    7amada     5aled 
4    Sophia     Andrew
5    Sofiya     Jaxon
6    Matthieu   Jackson
7    Matthieu   Jozeph
8    Mathew     Andru

So I am trying to return all people their first name is "Mathew" :
Matthew, Mathew, and Matthieu

Or people their first name or last name is "Hamada":
Hamada, 7amada, 7mada

I have tried to replace these numbers with corresponding letters, then use get_close_matches function, but it's neither accurate nor pythonic.

EDIT:
I think it will be better to replace all multiple spellings with the popular one(in both, first and last). So if {"Matthew": 4, "Mathew": 2, "Matthieu": 1} , replace "Mathew", and "Matthieu" with "Matthew"

Can you [edit] your question to show how you've tried to use `get_close_matches`? (that's presumably also the `difflib.get_close_matches` method?) — Jon Clements, Apr 15 '19 at 23:45
@alihassan do you want to return the matches as a new column or in another way? — Erfan, Apr 15 '19 at 23:58
@Erfan Return Index, So i can count/return first and last names. — Ali H. El-Kassas, Apr 16 '19 at 00:00

score 1 · Answer 1 · answered Apr 16 '19 at 00:03

1

You can do the following to group close matches and return it as a new column:

from difflib import get_close_matches as gsm

df['Close_Matches'] = [', '.join(gsm(name, df.firstName)) for name in df.firstName]

print(df)

  firstName lastName               Close_Matches
0       Ali   Khaled                         Ali
1    Hamada     5ald              Hamada, 7amada
2       3ly    7mada                         3ly
3    7amada    5aled              7amada, Hamada
4    Sophia   Andrew              Sophia, Sofiya
5    Sofiya    Jaxon              Sofiya, Sophia
6  Matthieu  Jackson  Matthieu, Matthieu, Mathew
7  Matthieu   Jozeph  Matthieu, Matthieu, Mathew
8    Mathew    Andru  Mathew, Matthieu, Matthieu

answered Apr 16 '19 at 00:03

Erfan

40,971
8
66
78

it didn't get Ali, and 3ly as close matches, so i have to preprocess it first to get accurate result.
But i have two issues with this solution:
First: we call ```get_close_matches(name, df.firstName)``` every iteration, it's so expensive.
Second: I think it will be better to replace all multiple spellings with the popular one(in both, first and last). – Ali H. El-Kassas Apr 16 '19 at 00:20
Unfortunately, we cannotnsee _what people mean_. You didnt state in your question you want to replace anything. But I will have a look at it – Erfan Apr 16 '19 at 00:22

score 1 · Answer 2 · answered Apr 17 '19 at 09:59

To find the similarity between two words\sentences you may want to use somthing like Edit Distance or Jaccard Distance.

Let's test it in your case using Edit Distance :

firstName = ['Ali', 'Hamada', '3ly', '7amada', 'Sophia', 'Sofiya', 'Matthieu', 'Matthieu', 'Mathew']

#No need to implement the distance function, you can call it from NLTK

import nltk

# Find similier first name using edit distance
for name in firstName:
    nameToCompare = [x for x in firstName if x != name]
    for n in nameToCompare:
        print(name, n, nltk.edit_distance(name, n))
    print('***************')

# Ali Hamada 6
# Ali 3ly 2
# Ali 7amada 6
# Ali Sophia 5
# Ali Sofiya 5
# Ali Matthieu 7
# Ali Matthieu 7
# Ali Mathew 6
#***************
# Hamada Ali 6
# Hamada 3ly 6
# Hamada 7amada 1
# Hamada Sophia 5
# Hamada Sofiya 5
# Hamada Matthieu 7
# Hamada Matthieu 7
# Hamada Mathew 5
#***************
# 3ly Ali 2
# 3ly Hamada 6
# 3ly 7amada 6
# 3ly Sophia 6
# 3ly Sofiya 5
# 3ly Matthieu 8
# 3ly Matthieu 8
# 3ly Mathew 6
#***************
# 7amada Ali 6
# 7amada Hamada 1
# 7amada 3ly 6
# 7amada Sophia 5
# 7amada Sofiya 5
# 7amada Matthieu 7
# 7amada Matthieu 7
# 7amada Mathew 5
#***************
# Sophia Ali 5
# Sophia Hamada 5
# Sophia 3ly 6
# Sophia 7amada 5
# Sophia Sofiya 3
# Sophia Matthieu 6
# Sophia Matthieu 6
# Sophia Mathew 5
#***************
# Sofiya Ali 5
# Sofiya Hamada 5
# Sofiya 3ly 5
# Sofiya 7amada 5
# Sofiya Sophia 3
# Sofiya Matthieu 7
# Sofiya Matthieu 7
# Sofiya Mathew 6
#***************
# Matthieu Ali 7
# Matthieu Hamada 7
# Matthieu 3ly 8
# Matthieu 7amada 7
# Matthieu Sophia 6
# Matthieu Sofiya 7
# Matthieu Mathew 3
#***************
# Matthieu Ali 7
# Matthieu Hamada 7
# Matthieu 3ly 8
# Matthieu 7amada 7
# Matthieu Sophia 6
# Matthieu Sofiya 7
# Matthieu Mathew 3
#***************
# Mathew Ali 6
# Mathew Hamada 5
# Mathew 3ly 6
# Mathew 7amada 5
# Mathew Sophia 5
# Mathew Sofiya 6
# Mathew Matthieu 3
# Mathew Matthieu 3
#***************

The small numbers means it's more similar. You can noticed that it can identify the similar mane with different spelling.

Now let's apply Jaccard Distance

for name in firstName:
    nameToCompare = [x for x in firstName if x != name]
    for n in nameToCompare:
        print(name, n, (1-nltk.jaccard_distance(set(name), set(n)))*100)
    print('***************')

# Ali Hamada 0.0
# Ali 3ly 19.999999999999996
# Ali 7amada 0.0
# Ali Sophia 12.5
# Ali Sofiya 12.5
# Ali Matthieu 11.111111111111116
# Ali Matthieu 11.111111111111116
# Ali Mathew 0.0
#***************
# Hamada Ali 0.0
# Hamada 3ly 0.0
# Hamada 7amada 60.0
# Hamada Sophia 11.111111111111116
# Hamada Sofiya 11.111111111111116
# Hamada Matthieu 9.999999999999998
# Hamada Matthieu 9.999999999999998
# Hamada Mathew 11.111111111111116
#***************
# 3ly Ali 19.999999999999996
# 3ly Hamada 0.0
# 3ly 7amada 0.0
# 3ly Sophia 0.0
# 3ly Sofiya 12.5
# 3ly Matthieu 0.0
# 3ly Matthieu 0.0
# 3ly Mathew 0.0
#***************
# 7amada Ali 0.0
# 7amada Hamada 60.0
# 7amada 3ly 0.0
# 7amada Sophia 11.111111111111116
# 7amada Sofiya 11.111111111111116
# 7amada Matthieu 9.999999999999998
# 7amada Matthieu 9.999999999999998
# 7amada Mathew 11.111111111111116
#***************
# Sophia Ali 12.5
# Sophia Hamada 11.111111111111116
# Sophia 3ly 0.0
# Sophia 7amada 11.111111111111116
# Sophia Sofiya 50.0
# Sophia Matthieu 30.000000000000004
# Sophia Matthieu 30.000000000000004
# Sophia Mathew 19.999999999999996
#***************
# Sofiya Ali 12.5
# Sofiya Hamada 11.111111111111116
# Sofiya 3ly 12.5
# Sofiya 7amada 11.111111111111116
# Sofiya Sophia 50.0
# Sofiya Matthieu 18.181818181818176
# Sofiya Matthieu 18.181818181818176
# Sofiya Mathew 9.090909090909093
#***************
# Matthieu Ali 11.111111111111116
# Matthieu Hamada 9.999999999999998
# Matthieu 3ly 0.0
# Matthieu 7amada 9.999999999999998
# Matthieu Sophia 30.000000000000004
# Matthieu Sofiya 18.181818181818176
# Matthieu Mathew 62.5
#***************
# Matthieu Ali 11.111111111111116
# Matthieu Hamada 9.999999999999998
# Matthieu 3ly 0.0
# Matthieu 7amada 9.999999999999998
# Matthieu Sophia 30.000000000000004
# Matthieu Sofiya 18.181818181818176
# Matthieu Mathew 62.5
#***************
# Mathew Ali 0.0
# Mathew Hamada 11.111111111111116
# Mathew 3ly 0.0
# Mathew 7amada 11.111111111111116
# Mathew Sophia 19.999999999999996
# Mathew Sofiya 9.090909090909093
# Mathew Matthieu 62.5
# Mathew Matthieu 62.5
#***************

Also we have great results!

Hope this help

score 0 · Answer 3 · answered Apr 16 '19 at 00:16

The issue is that the concept of "the same name with a different spelling" depends on phonetics. People determine this by listening to the pronunciation of both names and saying "hey, these sound the same." The only way a computer could possibly know that "Matthew" and "Matthieu" are the "same name" would be to run some type of text-to-speech into some audio analysis.

As this is most likely not what you want to do, the only thing you could really look at would be the Hamming distance, and define some threshold (perhaps 1 character) which you accept as "the same name." This is most likely what get_close_matches() does, but scores it as a ratio to the word length. But even that will have false positives (there are surely distinct names with a Hamming distance of 1, even if I can't think of any right now) and you won't correctly group names like "Haley" and "Hayleigh" until you crank that threshold up to 4 and then you will have quite a lot of false positives.

Not to mention that names are not required to be pronounced phonetically whatsoever. I can name my son "a" and prounounce it "Jared." How could you possibly detect that this is an alternative spelling to "Jerrod?" You cannot, and therefore you cannot programmatically determine if two names are "the same." The issue is that the problem itself is not well defined. You could better define it by saying that you would like to group names together that are "phonetically the same." That allows you to skip contrived examples like "a" but you've merely traded that problem for a need for some sort of phonetic engine, which is far from trivial.

tl;dr not possible

How to return names"with different spelling" from dataframe

3 Answers3