I want to remove all different string from a dataframe and retain all "similar" string.
For example, I have this data:
store_name
------------
Mcdonalds
KFC
Burger King
Mcdonald's
Mcdo
Taco bell
The store that we need to compare above is the first row which is Mcdonalds
. With that, we need to remove other stores and retain all stores similar to the store we are checking.
Here is the expected output:
store_name
------------
Mcdonalds
Mcdonald's
Mcdo
The process will continue until it checks the Taco bell
.
By comparing string similarity, I am using fuzzy-wuzzy
library. If we compare two string and it gives 90+ similarity ratio, then we tag it as similar string. But how can I filter out the whole dataframe using drop?
From two string comparison:
ratio = fuzz.token_set_ratio(string_1, string_2)
To filtering whole dataframe:
# TODO: ERROR on this since we are comparing dataframe, not string.
for index, row in data_df.iterrows():
copied_data_df = data_df.copy()
store_name = data_df['store_name']
copied_data_df.drop(fuzz.token_set_ratio(store_name, copied_data_df) >= 90, inplace=True)