Approximate name matching to merge two dataframes python

Question

I am working with two dataframes (df1 and df2) of which I would like to merge df2 into df1 based on name matching, but between the two the names are not exactly matching (for example: 'JS Smith' may be "J.S. Smith (Jr)") and the names in df1 are in a list delimited by "|" for a variety of name variants.

Additionally, I have 1 other column in df2 that contains slightly different names that I would like to fall back to if there are no matches from the original column.

Lastly, I would only like to bring in the data from df2 if there is a single unique match in df1, and I would not like to overwrite an previous entries that have been brought in.

Here is an example of how the dfs are:

df1 (where N1 represents the first name in the list of name variats)

    Name variants
0   N1|N2|N3|N4
1   N1|N2|
2   N1|N2|N3

df2

    Name Type 1        Name Type 2        Data1     Data2     Data3
0   Name 0             Name 0.1           X         Y         Z
1   Name 1             Name 1.1           A         B         C
2   Name 2             Name 2.1           D         E         F

I would like to do the matching first on "Name Type 2" Supposing the matches are:

Name 0.1 -> one of the names in N1|N2 (row 1 of df1)
Name 2.1 -> one of the names in N1|N2|N3|N4 (row 0 of df1)
Name 1.1 -> Does not match any names in df1, I would then check Name 1 which matches with N1|N2|N3 (row 2 of df1)

The resulting new df would look like:

    Name Variants    Matched Named    Data1     Data2     Data3    Matched
0   N1|N2|N3|N4      Name2.1          D         E         F        True
1   N1|N2|           Name0.1          X         Y         Z        True
2   N1|N2|N3|        Name1            A         B         C        True

My current approach has been:

Loop through each row in df2 and search df1 using df1[df1['Name Variants'].contains('Name0.1')
If there is a unique match (1 row found in df1) and "Matched" is not marked as "True" then I pull in the data
If there are multiple matches I do not pull in the data
If there are no matches, I search for "Name 0" with the same methodology and again run through the same logic (1 match, no data currently merged, etc.)

My issue with this is:

It is exceptionally time intensive given
I am not matching as well as I could given the slight spelling differences as initially described

Here is the code to my current approach:

global_brands = set(ep["Global Brand"].dropna().str.replace("&", "").str.lower())
products = set(ep["Product"].dropna().str.replace("&", "").str.lower())
gx_name = set(ep["Generic Name"].dropna().str.replace(";","").str.lower())
#%%

print(len(global_brands))
print(len(products))
print(len(gx_name))
#%%
"""
add transformed names to ep and db

"""

ep["alt_global_brands"] = ep["Global Brand"].fillna("").str.replace("&", "").str.lower()
ep["alt_product"] = ep["Product"].fillna("").str.replace("&", "").str.lower()
ep["alt_gx_name"] = ep["Generic Name"].fillna("").str.replace(";","").str.lower()


db["alt_drug_names"] = db["Trans Drug Name"].str.lower()

#%%
print(db.loc[1805,"alt_drug_names"].split("|")[0] == "buprenorphine  naloxone")
#%%
print(ep.loc[166,"alt_product"] == "vx-661  ivacaftor")

#%%

ep['Match in db'] = ""
db['EP match'] = ""

num_product_nonmatches = 0
num_product_exact_matches = 0
double_matches = 0
for product in products:
    product_matches = len(db.ix[db["alt_drug_names"].str.contains(product)])

    if product_matches == 1:
        matched_row = db.ix[db["alt_drug_names"].str.contains(product)].index[0]

    if product_matches > 1:
        #print(db.ix[db["alt_drug_names"].str.contains(global_brand)]["alt_drug_names"].str.split("|"))
        num_matched_rows = 0
        for row, value in db.ix[db["alt_drug_names"].str.contains(product)]["alt_drug_names"].iteritems():
            names = value.split("|")
            for name in names:
                if product == name:
                    matched_row = row
                    num_matched_rows += 1

        if num_matched_rows == 1:
            product_matches = 1


        #elif num_matched_rows > 1: - At no point was there still a double match after looping through each rows name variants and looking for an exact match
        if num_matched_rows == 0:
            """
            Here after looping through the name variants there was no exact match
            This seems to be for assets that are too generic (ex: clonidine hydrochloride, rotavirus vaccine, etc.)

            Approach: 
            1. Check if name has / to split and create combo
            2. If no / or still no match => leverage generic name
            """
            product_copy = product
            if "("  in product:
                product = product.split("(")[0].strip()

            if "/" in product:
                product_split = product.split("/")
                for product_fragment in product_split:
                    product_fragment = product_fragment.strip()

                temp_product = ""
                for product_fragment in product_split:
                    temp_product = temp_product + product_fragment + "  "

                product = temp_product[:-len("  ")].strip()

            if len(db.ix[db["alt_drug_names"].str.contains(product)]) == 1: # this instance does not occur
                product_matches = 1
                matched_row = db.ix[db["alt_drug_names"].str.contains(product)].index[0]

            elif len(db.ix[db["alt_drug_names"].str.contains(product)]) > 1:
                num_matched_rows = 0
                for row, value in db.ix[db["alt_drug_names"].str.contains(product)]["alt_drug_names"].iteritems():
                    names = value.split("|")
                    for name in names:
                        if product == name:
                            matched_row = row
                            num_matched_rows += 1


                if num_matched_rows == 1:
                    product_matches = 1

            product = product_copy

    if product_matches == 0:
        num_product_nonmatches += 1
        """
        Check if name has / to split and create combo

        LEVERAGE GENERIC NAME

        """

        #product_name = ep[ep["Global Brand"].str.replace("&", "+")]
        #product_matches = len(db.ix[db["Drug Name"].str.contains(global_brand) and db.ix[db["Drug Name"].str.contains(global_brand)])
    if product_matches == 1:
        num_product_exact_matches += 1
#        print(product)
#        print(matched_row)
        #print(product)
        ep_row = ep[ep['alt_product'] == product].index[0]
        if ep.loc[ep_row,'Match in db'] == "":
            ep.loc[ep_row,'Match in db'] = "TRUE"
        if db.loc[matched_row,'EP match'] == "":
            db.loc[matched_row, 'EP match'] = "TRUE"
            db.loc[matched_row, 'EP Global Name'] = ep.loc[ep_row, 'Global Brand']
            db.loc[matched_row, 'EP Product'] = ep.loc[ep_row, 'Product']
            db.loc[matched_row, 'EP Generic Name'] = ep.loc[ep_row, 'Generic Name']
            db.loc[matched_row, 'EP Company'] = ep.loc[ep_row, 'Company']
            db.loc[matched_row, 'EP Rx or OTC'] = ep.loc[ep_row, 'Prescription']
            db.loc[matched_row, 'EP markets'] = ep.loc[ep_row, 'Markets']

            columns = ['2015 Actual/ Est. (Sales)','WW sales - 2008','WW sales - 2009','WW sales - 2010','WW sales - 2011','WW sales - 2012','WW sales - 2013','WW sales - 2014','WW sales - 2015',
                       'WW sales - 2016','WW sales - 2017','WW sales - 2018','WW sales - 2019','WW sales - 2020','WW sales - 2021','WW sales - 2022','WW sales - 2023','WW sales - 2024','WW sales - 2025',
                       'WW CAGR (2018 or Launch - 2025)','WW Est. Launch','U.S. sales - 2008','U.S. sales - 2009','U.S. sales - 2010','U.S. sales - 2011','U.S. sales - 2012','U.S. sales - 2013',
                       'U.S. sales - 2014','U.S. sales - 2015','U.S. sales - 2016','U.S. sales - 2017','U.S. sales - 2018','U.S. sales - 2019','U.S. sales - 2020','U.S. sales - 2021','U.S. sales - 2022',
                       'U.S. sales - 2023','U.S. sales - 2024','U.S. sales - 2025','U.S. CAGR (2018 or Launch - 2025)','Forecasters','Forecast Statistics']

            for col in columns:
                db.loc[matched_row, col] = ep.loc[ep_row, col]

            db.loc[matched_row, 'U.S. Est. Launch'] = ep.loc[ep_row,'U.S. Est. Lauch']



#%%

print("EP non matches: " + str(num_product_nonmatches))
print("EP matches: " + str(num_product_exact_matches))
print("EP total: " + str(num_product_nonmatches + num_product_exact_matches))
print("EP total products: " + str(len(ep)))
print("EP length of product set: " + str(len(products)))
print("EP double_matches: " + str(double_matches))

I think you really need to take the time to learn about [tidy data](http://vita.had.co.nz/papers/tidy-data.pdf) or 'normalizing' your data as it would make a lot of these analyses much easier and without long looping. The tidy data paper by hadley wickham is an excellent start. — Ted Petrou, Jan 12 '17 at 16:16
Thank you for the resource, will print and read today! I'm sure it is quite evident but I don't have a programming background and have only recently picked python up, so I thank you for the support and guidance! — wingsoficarus116, Jan 12 '17 at 16:22

Approximate name matching to merge two dataframes python

0 Answers0