-1

I have 50 000 words like :

  • add
  • to add
  • chicken
  • a chicken
  • eat the chicken
  • to eat
  • ...

And i want to drop the line which have a high fuzzy similarity with other lines.

Then the output should be:

  • add
  • to eat
  • chicken
  • ...

I can't calculate every fuzzy match (50 000**2 match is to high), and i search a methode like the KD-Tree / Ball-Tree but working with string distance (Levenstein distance, fuzzy distance ...)

I prefer to use only python, but i'am open mind ! Thank you very much :)

Arnaud Hureaux
  • 121
  • 1
  • 10
  • 1
    You need to specify more the method. Why keep `chicken` vs `eat the chickens` for example? In overview, you can create a set of all words other than articles. Use that set as keys in a dict pointing to a list of phrases containing that word. Sort the lists on len. Keep the shortest. – dawg Jan 16 '22 at 18:27
  • **I allready specify, i said "drop the line which have a high fuzzy similarity"** But drop which one? Shortest? Longest? First or last? It is a decent question but you need to be more specific. Use a subset of actual data with actual results you want to see. The example is bad. – dawg Jan 16 '22 at 18:40

1 Answers1

-1

After personnal search on other topic, i didn't find fast solution using simple python librairie

But i get the two ideas to solv my problem : 1.to vectorize each "string", by example : "a chicken" -> (0,0,0,0,...1,0,...1,0,0) And use a KD-Tree/Ball-Tree with sklearn (see my topic to implement it: Find nearest point in other dataframe (WITH A LOT OF DATA))

2.Stem every strings and apply a .drop_duplicates with pandas ;)

Arnaud Hureaux
  • 121
  • 1
  • 10