0

I want to iterate through a Pandas dataframe and get the fuzz.ratio score only for each row pair (not for all combinations). My dataframe looks like this:

Acct_Owner, Address, Address2

0, Name1, NaN, 33 Liberty Street
1, Name2, 330 N Wabash Ave Ste 39300, 330 North Wabash Avenue Suite 39300

There are missing values, so I am using "try:" to skip over missing value rows. Below is the current for loop:

for row in df_high_scores.index:
    k1 = df_high_scores.get_value(row, 'Address')
    k2 = df_high_scores.get_value(row, 'Address2')

    try:
        df_high_scores['Address_Score'] = fuzz.ratio(k1, k2)
    except:
        None

The result is showing the same score for all rows. Hoping to figure out why the loop isn't iterating through and scoring each row. Thanks for reading...

mwhee
  • 652
  • 2
  • 6
  • 17

1 Answers1

1

The assignment needs to use the correct row with index.

df_high_scores.loc[row, 'Address_Score'] = fuzz.ratio(k1, k2)

A better way to do this instead of iterating rows is:

df_high_scores['Address_Score'] = df_high_scores.apply(lambda x : fuzz.ratio(x.Address, x.Address2), axis=1)

apply is actually slow with large arrays. Look up fuzzy to see if you can pass numpy array or pandas Series as inputs.

  • Thanks Babu! The former approach works and makes perfect sense. If I run into issues with larger data, I will use the .Apply option. – mwhee Nov 22 '17 at 17:24