2

I have a list of words:

lst = ['dog', 'cat', 'mate', 'mouse', 'zebra', 'lion']

I also have a pandas dataframe:

df = pd.DataFrame({'input': ['dog', 'kat', 'leon', 'moues'], 'suggested_class': ['a', 'a', 'a', 'a']})

input   suggested_class
dog          a
kat          a
leon         a
moues        a

I would like to populate the suggested_class column with the value from lst that has the highest levenshtein distance to a word in the input column. I am using the fuzzywuzzy package to calculate that.

The expected output would be:

input   suggested_class
dog          dog
kat          cat
leon         lion
moues        mouse

I'm aware that one could implement something with the autocorrect package like df.suggested_class = [autocorrect.spell(w) for w in df.input] but this would not work for my situation.

I've tried something like this (using from fuzzywuzzy import fuzz):

for word in lst:
    for n in range(0, len(df.input)):
        if fuzz.ratio(df.input.iloc[n], word) >= 70:
            df.suggested_class.iloc[n] = word
        else:
            df.suggested_class.iloc[n] = "unknown"

which only works for a set distance. I've been able to capture the max distance with:

max([fuzz.ratio(df.input.iloc[0], word) for word in lst])

but am having trouble relating that to a word from lst, and subsequently populating suggested_class with that word.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
Luxo_Jr
  • 379
  • 1
  • 3
  • 12

1 Answers1

3

Since you mention fuzzywuzzy

from fuzzywuzzy import process
df['suggested_class']=df.input.apply(lambda x : [process.extract(x, lst, limit=1)][0][0][0])

df
Out[1365]: 
   input suggested_class
0    dog             dog
1    kat             cat
2   leon            lion
3  moues           mouse
BENY
  • 317,841
  • 20
  • 164
  • 234
  • 1
    This works great, thanks! I had seen the `process` module, but didn't realize you could pass the distance method through `scorer=` which in my case i need to use `token_set_ratio`! – Luxo_Jr Apr 05 '18 at 20:05
  • @Luxo_Jr glad I could help :-) – BENY Apr 05 '18 at 20:08