I have 50 000 words like :
- add
- to add
- chicken
- a chicken
- eat the chicken
- to eat
- ...
And i want to drop the line which have a high fuzzy similarity with other lines.
Then the output should be:
- add
- to eat
- chicken
- ...
I can't calculate every fuzzy match (50 000**2 match is to high), and i search a methode like the KD-Tree / Ball-Tree but working with string distance (Levenstein distance, fuzzy distance ...)
I prefer to use only python, but i'am open mind ! Thank you very much :)