1

In a Pandas dataframe, I need to remove entries that are too close with respect to Levenshtein distance. An inefficient implementation is:

i = 0
j = 0

for index, row in df.iterrows():
  text1 = row['text']
  for index2, row2 in df.iterrows():
     text2 = row2['text']
     lev_ratio = Levenshtein.ratio(text1, text2)
     if j != i and lev_ratio > 0.9:
         df.drop(index2, inplace = True)     
     j += 1
  i += 1

Is there a more efficient way ?

  • Reminds me of [this question](https://stackoverflow.com/questions/48174398/new-dataframe-column-as-a-generic-function-of-other-rows-pandas). – pault May 01 '18 at 20:56
  • I put 0.9 for the example but I need to remove texts that are nearly exactly identical, so I guess in my case I will have: if a is close to b and b is close to c then a is close to c. (Answer to a comment that has been removed ?) –  May 01 '18 at 21:04
  • Your `drop(..., inplace=True)` in the middle of iterating seems very questionable, no? – Ami Tavory May 01 '18 at 21:06
  • Could you elaborate ? –  May 01 '18 at 21:22
  • It might cause the loop to skip words, or do something undefined. – Ami Tavory May 01 '18 at 21:28

1 Answers1

0

To start with a side point, you might want to check if you can inplace-drop while iterating:

for index, row in df.iterrows():
  for index2, row2 in df.iterrows():
     df.drop(index2, inplace = True) # <- is this safe?

To the point of your question, since you're looking only for cases when the Levenshtein ratio is greater than 0.9, there's no need to actually calculate it when it can be efficiently be seen to be lower. Thus, for example, if one word has length 4 and the other length 8, the distance will be less than 0.9. Consequently, you can consider something like:

pairs = []
for i, w_i in enumerte(enumerte(words.values)):
    for j, w_j in enumerte(enumerte(words.values)):
        if i >= j:
           continue
        if len(set(w_i).symmetric_distance(w_j))  > 1.8 * min(len(w_i), len(w_j)):
           continue
        # Calculate distance only here.
Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • Actually I am not certain of the 0.9 value; I was hoping for a more general optimization, something in Pandas that allowed to avoid a quadratic loop ? –  May 01 '18 at 21:34
  • @Henry I do not think it's possible to eliminate the quadratic outer loop. Your implementation has a quadratic loop *nested* in a quadratic loop: for each two words, it iterates over each two letters of each word. My only advice (in the answer) is to optimize the inner quadratic loop, by simply not calling it under special cases. – Ami Tavory May 01 '18 at 21:36