1

I have dataframe column with typos.

ID Banknane
1 Bank of America
2 bnk of America
3 Jp Morg
4 Jp Morgan

And I have a list with the right names of the banks.

["Bank of America", "JPMorgan Chase]

I want to check and replace wrong banknames with the right names of the list with the help of levenshtein distance.

1 Answers1

1

Here is one simple way to do it using Python standard library difflib module, which provides helpers for computing deltas.

from difflib import SequenceMatcher

# Define a helper function
def match(x, values, threshold):
    def ratio(a, b):
        return SequenceMatcher(None, a, b).ratio()

    results = {
        value: ratio(value, x) for value in values if ratio(value, x) > threshold
    }
    return max(results, key=results.get) if results else x

And then:

import pandas as pd

df = pd.DataFrame(
    {
        "ID": [1, 2, 3, 4],
        "Bankname": ["Bank of America", "bnk of America", "Jp Morg", "Jp Morgan"],
    }
)

names = ["Bank of America", "JPMorgan Chase"]

df["Bankname"] = df["Bankname"].apply(lambda x: match(x, names, 0.4))

So that:

print(df)
# Output
   ID         Bankname
0   1  Bank of America
1   2  Bank of America
2   3   JPMorgan Chase
3   4   JPMorgan Chase

Of course, you can replace the inner ratio function with any other more appropriated sequence matcher.

Laurent
  • 12,287
  • 7
  • 21
  • 37