Remove all different string on dataframe using fuzzywuzzy

Question

I want to remove all different string from a dataframe and retain all "similar" string.

For example, I have this data:

store_name
------------
Mcdonalds
KFC
Burger King
Mcdonald's
Mcdo
Taco bell

The store that we need to compare above is the first row which is Mcdonalds. With that, we need to remove other stores and retain all stores similar to the store we are checking.

Here is the expected output:

store_name
------------
Mcdonalds
Mcdonald's
Mcdo

The process will continue until it checks the Taco bell.

By comparing string similarity, I am using fuzzy-wuzzy library. If we compare two string and it gives 90+ similarity ratio, then we tag it as similar string. But how can I filter out the whole dataframe using drop?

From two string comparison:

ratio = fuzz.token_set_ratio(string_1, string_2)

To filtering whole dataframe:

    # TODO: ERROR on this since we are comparing dataframe, not string.
    for index, row in data_df.iterrows():
        copied_data_df = data_df.copy()
        store_name = data_df['store_name']
        copied_data_df.drop(fuzz.token_set_ratio(store_name, copied_data_df) >= 90, inplace=True)

score 0 · Answer 1 · answered Dec 24 '21 at 08:38

So, with the following dataframe:

import pandas as pd

from fuzzywuzzy import fuzz

df = pd.DataFrame(
    {
        "store_name": [
            "Mcdonalds",
            "KFC",
            "Burger King",
            "Mcdonald's",
            "Mcdo",
            "Taco bell",
        ]
    }
)

You can do this:

# Calculate similarities between first row value and other rows
# and save corresponding indexes in a new column "match"
df["match"] = df["store_name"].map(
    lambda x: [
        i
        for i, _ in enumerate(df["store_name"])
        if fuzz.ratio(x, df.loc[0, "store_name"]) > 80
    ]
)

# Select row match and clean up
df["match"] = df["match"].apply(lambda x: x if len(x) > 0 else pd.NA)
df = df.dropna().drop(columns="match").reset_index(drop=True)

Which outputs:

print(df)

   store_name
0   Mcdonalds
1  Mcdonald's
2        Mcdo

Remove all different string on dataframe using fuzzywuzzy

1 Answers1