Guys I have a dataset like this: `
df = pd.DataFrame(data = ['John','gal britt','mona','diana','molly','merry','mony','molla','johnathon','dina'],\
columns = ['Name'])
df
` it gives this output
Name
0 John
1 gal britt
2 mona
3 diana
4 molly
5 merry
6 mony
7 molla
8 johnathon
so I imagine that to get all names across each other and detect the similarity I will use df.merge(df,how = "cross" )
The thing is the real data is 40000 rows and performing this will result in a very big dataset which I don't have the memory for. any algorithm or idea would really help and I'll adjust the logic to my purposes
I tried working with vaex
instead of pandas to work with this huge amount of data but still I run into the problem of insufficient memory allocation.
In short: I KNOW that this algorithm or way of thinking about such problem is wrong and inefficient.