0

I have a list of company names which are not properly aligned. Data set looks like

df[Name]= [Google, google, Google.inc, Google Inc., Google.com]

I have about 500,000 rows and name should be corrected with best way possible.

My code looks like below:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd

get_match = []

for row in df.index:
    name1= df.get_value(row,"Name")
    for columns in df2.index:
        name2=df2.get_value(columns,"Name")
matched_token=[process.extract(x, name2, limit=3) for x in name1]
        get_match.append([matched_token, name1, name2])
df_maneet = pd.DataFrame({'Ratio': [i[0] for i in get_match], 'name1': [i[1] for i in get_match], 'name2':[i[2] for i in get_match]})

My result in matched_token is

[[('google', 100, 0), ('Sxyzdgg.', 48, 9), ('ggigsk', 45, 2)]]

but I want to append result in df and see result like below.

enter image description here

I think I am running something wrong in matched.token line, but can't figure out.

Thanks in advance

Maneet Giri
  • 185
  • 3
  • 18

1 Answers1

1

Maybe this code will help you:

import pandas as pd
df = pd.DataFrame({"Name" : ["Google","google.inc"]})
df2 = pd.DataFrame({"Name" : ["google","google"]})

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

get_match = []
for row in df.index:
    name1 = []
    name1.append(df.get_value(row,"Name"))
    for columns in df2.index:
        name2 = []
        name2.append(df2.get_value(columns,"Name") )
        matched_token=[process.extract(x, name2, limit=3)[0][1] for x in name1]
        get_match.append([matched_token, name1[0], name2[0]])
df_maneet = pd.DataFrame({'name1': [i[1] for i in get_match], 'name2':[i[2] for i in get_match], 'Ratio': [i[0][0] for i in get_match]})

Final dataframe:

name1   name2  Ratio  

0 Google google 100
1 Google google 100
2 google.inc google 90
3 google.inc google 90

Arthur G.
  • 93
  • 1
  • 10
  • Thanks for answering my question but I honestly do not know how would I apply this "df = pd.DataFrame({"Name" : ["Google","google.inc"]})" as in Name column I have more that 200,000 values. – Maneet Giri Nov 05 '18 at 16:33
  • It was just an example. You can have as much values as you want - it doesn't matter - solution is the same. I can't see your dataframe so I had to created a simple example ;) – Arthur G. Nov 05 '18 at 17:52
  • Thanks it worked.. Just the problem is "limit=3" does not have any effect and the code is still giving all possible matches... – Maneet Giri Nov 05 '18 at 18:51