-3

I have a list of company names, but these have misspelling and variations. How best can I fix this so every company has the consistent naming convention (for later groupby, sort_value, etc.)?

pd.DataFrame({'Company': ['Disney','Dinsey', 'Walt Disney','General Motors','General Motor','GM','GE','General Electric','J.P. Morgan','JP Morgan']})
denpy
  • 279
  • 2
  • 10

1 Answers1

0

One good hint: FuzzyWuzzy library. "Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package." Example:

from fuzzywuzzy import process
from fuzzywuzzy import fuzz    

str2Match = "apple inc"
strOptions = ["Apple Inc.","apple park","apple incorporated"]
Ratios = process.extract(str2Match,strOptions)
print(Ratios)
# You can also select the string with the highest matching percentage
highest = process.extractOne(str2Match,strOptions)
print(highest)

output:

[('Apple Inc.', 100), ('apple incorporated', 90), ('apple park', 67)]
('Apple Inc.', 100)

Now you just have to create a list with the "right names" and run all the variations against it so you can pick the best ratio and replace it on your dataset.