I have a dataframe of drug IDs (NDC_NBR) and their corresponding drug names (BRAND_NM).
I need to collapse/aggregate the drug names to the least specificity as possible per drug.
Here is an example of the data I am working with and the expected outcome:
data <- data.frame(NDC_NBR = c("00002773701","00002775201","00002775205","00002822201","00002822259","00002823301","00002823305","00074024302","00074006702","00074433902"),BRAND_NM = c("INSULIN LISPRO","INSULIN LISPRO JUNIOR KWIKPEN","INSULIN LISPRO JUNIOR KWIKPEN","INSULIN LISPRO KWIKPEN","INSULIN LISPRO KWIKPEN","INSULIN LISPRO PROTAMINE/INSUL","INSULIN LISPRO PROTAMINE/INSUL","HUMIRA","HUMIRA PEDIATRIC CROHNS DISEAS","HUMIRA PEN"), RESULT = c("INSULIN LISPRO","INSULIN LISPRO","INSULIN LISPRO","INSULIN LISPRO","INSULIN LISPRO","INSULIN LISPRO","INSULIN LISPRO","HUMIRA","HUMIRA","HUMIRA"))
I wish to collapse these products to the least specific drug names i.e. mutate a new column with the character "INSULIN LISPRO" and "HUMIRA".
"INSULIN LISPRO" is common to the first 7 rows yet "INSULIN LISPRO KWIKPEN" is only common to 2 of the 7. Similarly HUMIRA has no similarity to any of the "INSULIN" rows but is common to all of the last 3 rows.
I have a huge data frame of such products and it's not possible to manually convert each one.
I'd be very grateful if anyone could suggest a solution to such a problem.