Here is one simple way to do it using Python standard library difflib module, which provides helpers for computing deltas.
from difflib import SequenceMatcher
# Define a helper function
def match(x, values, threshold):
def ratio(a, b):
return SequenceMatcher(None, a, b).ratio()
results = {
value: ratio(value, x) for value in values if ratio(value, x) > threshold
}
return max(results, key=results.get) if results else x
And then:
import pandas as pd
df = pd.DataFrame(
{
"ID": [1, 2, 3, 4],
"Bankname": ["Bank of America", "bnk of America", "Jp Morg", "Jp Morgan"],
}
)
names = ["Bank of America", "JPMorgan Chase"]
df["Bankname"] = df["Bankname"].apply(lambda x: match(x, names, 0.4))
So that:
print(df)
# Output
ID Bankname
0 1 Bank of America
1 2 Bank of America
2 3 JPMorgan Chase
3 4 JPMorgan Chase
Of course, you can replace the inner ratio
function with any other more appropriated sequence matcher.