Delete "almost duplicates" rows of string based on fuzzy matching with a lot of lines (>50 000)

Question

I have 50 000 words like :

add
to add
chicken
a chicken
eat the chicken
to eat
...

And i want to drop the line which have a high fuzzy similarity with other lines.

Then the output should be:

add
to eat
chicken
...

I can't calculate every fuzzy match (50 000**2 match is to high), and i search a methode like the KD-Tree / Ball-Tree but working with string distance (Levenstein distance, fuzzy distance ...)

I prefer to use only python, but i'am open mind ! Thank you very much :)

You need to specify more the method. Why keep `chicken` vs `eat the chickens` for example? In overview, you can create a set of all words other than articles. Use that set as keys in a dict pointing to a list of phrases containing that word. Sort the lists on len. Keep the shortest. — dawg, Jan 16 '22 at 18:27
**I allready specify, i said "drop the line which have a high fuzzy similarity"** But drop which one? Shortest? Longest? First or last? It is a decent question but you need to be more specific. Use a subset of actual data with actual results you want to see. The example is bad. — dawg, Jan 16 '22 at 18:40

score -1 · Answer 1 · answered Jan 16 '22 at 18:50

After personnal search on other topic, i didn't find fast solution using simple python librairie

But i get the two ideas to solv my problem : 1.to vectorize each "string", by example : "a chicken" -> (0,0,0,0,...1,0,...1,0,0) And use a KD-Tree/Ball-Tree with sklearn (see my topic to implement it: Find nearest point in other dataframe (WITH A LOT OF DATA))

2.Stem every strings and apply a .drop_duplicates with pandas ;)

Delete "almost duplicates" rows of string based on fuzzy matching with a lot of lines (>50 000)

1 Answers1