I have a dataframe and a string list:
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df = pd.DataFrame({'Name': ['PARIS', 'NEW YORK', 'MADRI', 'PARI', 'P ARIS', 'NOW YORK',
'PORTUGAL', 'PORTUGLA'],
'Column_two': [1,2,3,4,5,6,7,8]
})
print(df)
# Output:
Name Column_two
PARIS 1
NEW YORK 2
MADRI 3
PARI 4
P ARIS 5
NOW YORK 6
PORTUGAL 7
PORTUGLA 8
list_string_correct = ['PARIS', 'NEW YORK', 'PORTUGAL']
I am using Fuzzywuzzy python library. This method returns a number that represents how similar the two compared strings are: Example: fuzz.partial_ratio("BRASIL", "BRAZIL")
# Output:
88
I would like to iterate through the 'Name' column of the dataframe and compare the string to var_string_correct. If these are similar, I would like to replace it with the correct name (which is the name of the string). So, I made the following code:
for i in range(0, len(df)):
for j in range(0, len(list_string_correct)):
var_string = list_string_correct[j]
# Return number [0 until 100]
result = fuzz.partial_ratio(var_string, df['Name'].iloc[i])
if(fuzz.partial_ratio(var_string, df['Name'].iloc[i]) >= 80): # Condition
df['Name'].loc[i] = var_string
The code is working. The output is as desired:
print(df)
# Output:
Name Column_two
PARIS 1
NEW YORK 2
MADRI 3
PARIS 4
PARIS 5
NEW YORK 6
PORTUGAL 7
PORTUGAL 8
However, I needed to use two for() commands. Is there a way to replace the for() and keep the same output?
To install the libraries use:
pip install fuzzywuzzy
pip install python-Levenshtein