Matching columns in dataframe using regex

Question

Looking to perform a regex function to match a column of a dataframe with the first word of another. The dataframes were collected from different sources so the names of the drug are similar but do not match completely. They do match up if you ignore case and match for the first word.

I have two dataframes: one with drug names and another with a list of drug names with their respective prices. Fruits were added to the drug names for example purposes.

Dataframe A  
      drug
0   drug1 apple
1   drug2 orange
2   drug3 lemon
3   drug4 peach

Dataframe B
   drugB    price   Regex
0  DRUG2     2  ^([\w\-]+)
1  DRUG4     4  ^([\w\-]+)
2  DRUG3     3  ^([\w\-]+)
3  DRUG1     1  ^([\w\-]+)

I am looking to use the Regex column to append dataframe A to B like so. Hopefully using the first name of drug column and match it to the respective column.

       drug         drugB   price   Regex
0   drug1 apple     DRUG1     1  ^([\w\-]+)
1   drug2 orange    DRUG2     2  ^([\w\-]+)
2   drug3 lemon     DRUG3     3  ^([\w\-]+)
3   drug4 peach     DRUG4     4  ^([\w\-]+)

I was inspired to try it this way based on the following stackoverflow question: How to merge pandas table by regex.

Thank you in advance! I hit a dead end with this problem and couldn't figure a way to get it to work.

Given your description, it seems like you may want `.str.split().str[0]` for DataFrameA and `.str.lower()` for DataFrame B, then you can exact match. — ALollz, Aug 12 '19 at 16:37

jottbe · Answer 1 · 2019-08-12T18:14:57.727

You don't really need to define the regexes in the second dataframe. ALollz is right btw. you could easily split the string, but I guess the purpose you need this for is more complex and probably you have drug names which include spaces.

Simple version with a common regex

If you can manage to define one common regex that matches all drug names, you can use the following code:

df_A['drugA']= df_A['drug'].str.extract('^\s*(?P<drugA>[\w\-]*)')['drugA'].str.upper()
df_A.merge(df_B[['drugB', 'price']], left_on='drugA', right_on='drugB', how='left')

Just replace the expression behind with the regex you need. The output would be:

           drug  drugA  drugB  price
0   drug1 apple  DRUG1  DRUG1      1
1  drug2 orange  DRUG2  DRUG2      2
2   drug3 lemon  DRUG3  DRUG3      3
3   drug4 peach  DRUG4  DRUG4      4

Version with a generated regex

drug_list= df_B['drugB'].to_list()
# sort the drug names by length descending
# to make sure we get the longest match
# --> relevant only if a drug name is included
#     fully in another name
#     Like "Aspirin" & "Aspirin plus C"
drug_list.sort(key=lambda drug: len(drug), reverse=True)

drug_pattern= '^\s*(?P<drugA>{drug_list})'.format(drug_list='|'.join(drug_list))

df_A['drugA']= df_A['drug'].str.extract(drug_pattern, re.I)['drugA'].str.upper()

df_A.merge(df_B[['drugB', 'price']], left_on='drugA', right_on='drugB', how='left')

This outputs the same as above. Please note, that this version might be limited regarding the number of drugs you can use. If you have hundrets of drugs, it might run into problems, because the regular expression string gets long in that case. But this version is sharper and also supports space in the drug names. In case you can work out one pattern, that is able to cut out all drug names correctly, I definatley would recommend to use the first method. E.g. if you can spot a pattern, that comes after the drug name, you can use it to cut out the drug names much easier.

Thanks! you were right about the drug names that I have being more difficult than the example I provided. Your regex example worked extremely well! but there were too many variance in the drug name formatting that I had to opt for the Alollz's answer and manually change the rows that were too difficult to create a regex for. — Hank Yun, Aug 13 '19 at 04:15

Matching columns in dataframe using regex

1 Answers1

Simple version with a common regex

Version with a generated regex