1

Guys I have a dataset like this: `

df = pd.DataFrame(data = ['John','gal britt','mona','diana','molly','merry','mony','molla','johnathon','dina'],\
                  columns = ['Name'])
df

` it gives this output

Name
0   John
1   gal britt
2   mona
3   diana
4   molly
5   merry
6   mony
7   molla
8   johnathon

so I imagine that to get all names across each other and detect the similarity I will use df.merge(df,how = "cross" )

The thing is the real data is 40000 rows and performing this will result in a very big dataset which I don't have the memory for. any algorithm or idea would really help and I'll adjust the logic to my purposes

I tried working with vaex instead of pandas to work with this huge amount of data but still I run into the problem of insufficient memory allocation. In short: I KNOW that this algorithm or way of thinking about such problem is wrong and inefficient.

  • What is your desired output? Please can you be more specific abou how you're measuring similarity? – Bashton Dec 28 '22 at 14:13
  • It's just fuzzywuzzy token_set_ratio. the first name for example is john will be paired with each name and then the fuzzy logic will be applied on each column – Ismail Awad Dec 28 '22 at 14:13
  • the thing is my algorithm is inefficient and any other solution/adjustment would be appreciated – Ismail Awad Dec 28 '22 at 14:16
  • you can use pyspark. This will process the data in chunks. Disadvantage: you won't to be able to use python library without compromising the speed. Although you can first filter the dataset using simple similarlity logic like len(common_words)/len(union_words) – user5828964 Dec 28 '22 at 14:26

0 Answers0