Fuzzy matching column with right names of a list

Question

I have dataframe column with typos.

ID	Banknane
1	Bank of America
2	bnk of America
3	Jp Morg
4	Jp Morgan

And I have a list with the right names of the banks.

["Bank of America", "JPMorgan Chase]

I want to check and replace wrong banknames with the right names of the list with the help of levenshtein distance.

Please provide enough code so others can better understand or reproduce the problem. — Community, Jun 24 '22 at 20:09

score 1 · Answer 1 · answered Jun 25 '22 at 14:47

Here is one simple way to do it using Python standard library difflib module, which provides helpers for computing deltas.

from difflib import SequenceMatcher

# Define a helper function
def match(x, values, threshold):
    def ratio(a, b):
        return SequenceMatcher(None, a, b).ratio()

    results = {
        value: ratio(value, x) for value in values if ratio(value, x) > threshold
    }
    return max(results, key=results.get) if results else x

And then:

import pandas as pd

df = pd.DataFrame(
    {
        "ID": [1, 2, 3, 4],
        "Bankname": ["Bank of America", "bnk of America", "Jp Morg", "Jp Morgan"],
    }
)

names = ["Bank of America", "JPMorgan Chase"]

df["Bankname"] = df["Bankname"].apply(lambda x: match(x, names, 0.4))

So that:

print(df)
# Output
   ID         Bankname
0   1  Bank of America
1   2  Bank of America
2   3   JPMorgan Chase
3   4   JPMorgan Chase

Of course, you can replace the inner ratio function with any other more appropriated sequence matcher.

Fuzzy matching column with right names of a list

1 Answers1