0

I have a list of strings that I am trying to match to values in a column. If it is a low match (below 95) I want to return the current column value if it is above 95 then I want to return the best fuzzy match from the list . I am trying to put all returned values into a new column. I keep getting the error "tuple index out of range", I think this maybe because it wants to return a tuple with the score and name but I only want the name. Here is my current code:

   from fuzzywuzzy import process
   from fuzzywuzzy import fuzz


   L = [ducks, frogs, doggies]

   df

   FOO    PETS
    a     duckz
    b     frags
    c     doggies

    def fuzz_m(column, pet_list, score_t):
        for c in column:
            new_name, score = process.extractOne(c, pet_list, score_t)
            if score<95:
                return c
            else:
                return new_name

    df['NEW_PETS'] = fuzz_m(df,L, fuzz.ratio)

Desired output:

    FOO    PETS      NEW_PETS
    a     duckz       ducks
    b     frags       frogs
    c     doggies     doggies
EEPBAH
  • 113
  • 12

1 Answers1

1

Several corrections.

  • Change

    df['NEW_PETS'] = fuzz_m(df,L, fuzz.ratio)

to

df['NEW_PETS'] = fuzz_m(df['PETS'], L, fuzz.ratio)
  • Make your list elements strings.

  • Fuzzywuzzy's extractOne method accepts both a processor and a scorer, in that order (link to source code.). Your positional argument of fuzz.ratio is mistakenly interpreted as a processor, when it's really a scorer. Change process.extractOne(c, pet_list, score_t) to process.extractOne(c, pet_list, scorer=score_t).

  • This loop-based code will not work as expected. fuzz_m is only called once, and its return value will be broadcast into all entries of the series df['NEW_PETS'].

A more pandas-friendly way:

L = ['ducks', 'frogs', 'doggies']

def fuzz_m(col, pet_list, score_t):
    new_name, score = process.extractOne(col, pet_list, scorer=score_t)
    if score<95:
        return col
    else:
        return new_name

df['NEW_PETS'] = df['PETS'].apply(fuzz_m, pet_list=L, score_t=fuzz.ratio)
Peter Leimbigler
  • 10,775
  • 1
  • 23
  • 37