How can I detect similarity of names in the same columns

Question

Guys I have a dataset like this: `

df = pd.DataFrame(data = ['John','gal britt','mona','diana','molly','merry','mony','molla','johnathon','dina'],\
                  columns = ['Name'])
df

` it gives this output

Name
0   John
1   gal britt
2   mona
3   diana
4   molly
5   merry
6   mony
7   molla
8   johnathon

so I imagine that to get all names across each other and detect the similarity I will use df.merge(df,how = "cross" )

The thing is the real data is 40000 rows and performing this will result in a very big dataset which I don't have the memory for. any algorithm or idea would really help and I'll adjust the logic to my purposes

I tried working with vaex instead of pandas to work with this huge amount of data but still I run into the problem of insufficient memory allocation. In short: I KNOW that this algorithm or way of thinking about such problem is wrong and inefficient.

What is your desired output? Please can you be more specific abou how you're measuring similarity? — Bashton, Dec 28 '22 at 14:13
It's just fuzzywuzzy token_set_ratio. the first name for example is john will be paired with each name and then the fuzzy logic will be applied on each column — Ismail Awad, Dec 28 '22 at 14:13
the thing is my algorithm is inefficient and any other solution/adjustment would be appreciated — Ismail Awad, Dec 28 '22 at 14:16
you can use pyspark. This will process the data in chunks. Disadvantage: you won't to be able to use python library without compromising the speed. Although you can first filter the dataset using simple similarlity logic like len(common_words)/len(union_words) — user5828964, Dec 28 '22 at 14:26

How can I detect similarity of names in the same columns

` it gives this output

0 Answers0