1

I have a dataframe and a string list:

      import pandas as pd
      from fuzzywuzzy import fuzz
      from fuzzywuzzy import process

      df = pd.DataFrame({'Name': ['PARIS', 'NEW YORK', 'MADRI', 'PARI', 'P ARIS', 'NOW YORK',
                                  'PORTUGAL', 'PORTUGLA'],                   
                         'Column_two': [1,2,3,4,5,6,7,8]                 
                         })

      print(df)

      # Output:

      Name   Column_two
     PARIS       1
     NEW YORK    2
     MADRI       3
      PARI       4
     P ARIS      5
    NOW YORK     6
    PORTUGAL     7
    PORTUGLA     8

      list_string_correct = ['PARIS', 'NEW YORK', 'PORTUGAL']

I am using Fuzzywuzzy python library. This method returns a number that represents how similar the two compared strings are: Example: fuzz.partial_ratio("BRASIL", "BRAZIL")

     # Output:
     88

I would like to iterate through the 'Name' column of the dataframe and compare the string to var_string_correct. If these are similar, I would like to replace it with the correct name (which is the name of the string). So, I made the following code:

      for i in range(0, len(df)):
          for j in range(0, len(list_string_correct)):
    
              var_string = list_string_correct[j] 

              # Return number [0 until 100]       
              result = fuzz.partial_ratio(var_string, df['Name'].iloc[i]) 
    
              if(fuzz.partial_ratio(var_string, df['Name'].iloc[i]) >= 80): # Condition            
                   df['Name'].loc[i] = var_string

The code is working. The output is as desired:

     print(df)

     # Output:

         Name   Column_two
         PARIS      1
        NEW YORK    2
         MADRI      3
         PARIS      4
         PARIS      5
        NEW YORK    6
        PORTUGAL    7
        PORTUGAL    8

However, I needed to use two for() commands. Is there a way to replace the for() and keep the same output?

To install the libraries use:

      pip install fuzzywuzzy
      pip install python-Levenshtein
Jane Borges
  • 552
  • 5
  • 14

2 Answers2

1

Try process.extractOne from thefuzz package (successor of fuzzywuzzy, same author, same api):

# from fuzzywuzzy import process
from thefuzz import process

THRESHOLD = 80

df['Name'] = \
    df['Name'].apply(lambda x: process.extractOne(x, list_string_correct,
                                   score_cutoff=THRESHOLD)).str[0].fillna(df['Name'])

Output:

>>> df
       Name  Column_two
0     PARIS           1
1  NEW YORK           2
2     MADRI           3
3     PARIS           4
4     PARIS           5
5  NEW YORK           6
6  PORTUGAL           7
7  PORTUGAL           8
Corralien
  • 109,409
  • 8
  • 28
  • 52
1

If for some reason you need to use the fuzzywuzzy package (instead of thefuzz as recommended by @Corralien), you can use one loop instead:

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

df = pd.DataFrame({'Name': ['PARIS', 'NEW YORK', 'MADRI', 'PARI', 'P ARIS', 'NOW YORK',
                            'PORTUGAL', 'PORTUGLA'],                   
                    'Column_two': [1,2,3,4,5,6,7,8]                 
                    })

list_string_correct = ['PARIS', 'NEW YORK', 'PORTUGAL']


for correct_name in list_string_correct:
    df['Name'] = df['Name'].apply(lambda x: correct_name if fuzz.partial_ratio(correct_name, x) >= 80 else x)

       Name  Column_two
0     PARIS           1
1  NEW YORK           2
2     MADRI           3
3     PARIS           4
4     PARIS           5
5  NEW YORK           6
6  PORTUGAL           7
7  PORTUGAL           8
Derek O
  • 16,770
  • 4
  • 24
  • 43
  • `fuzzywuzzy.process` (which you're already importing) allows to match a query against a list of options, so there is no need at all to loop the list items. – RJ Adriaansen Jan 07 '22 at 20:37