1

I'm working with a large dataset of drugs and reactions using R. For now, I have the data structured as a very tall data frame that lists the report ID number, the Drug name, and the reported reactions. As you can tell, there is a one-to-many relationship between both IDs vs. drugs and drugs vs. reactions.

Keeping in mind that this dataset is MUCH larger than what I can duplicate here, I'd like to know how to find what pairs of drugs lead to what reactions and in what frequency.

Most importantly, I am interested in how to approach a problem like this. Is the data structured correctly? What concepts or libraries should I read about?

Here's a link to some real data: https://www.dropbox.com/s/kzx4mpyytbo9zil/query_result.csv

   ID    DRUG                                REACTION
1  1827  ASPIRIN                           CHEST PAIN
2  1827  CLARINEX                          CHEST PAIN
3  1827  ASPIRIN                                COUGH
4  1827  CLARINEX                               COUGH
5  1827  ASPIRIN                HAEMOGLOBIN DECREASED
6  1827  CLARINEX               HAEMOGLOBIN DECREASED
7  1827  ASPIRIN           NEUTROPHIL COUNT INCREASED
8  1827  CLARINEX          NEUTROPHIL COUNT INCREASED
9  1827  ASPIRIN               PHARYNGOLARYNGEAL PAIN
10 1827  CLARINEX              PHARYNGOLARYNGEAL PAIN
...

In my teeny little brain, the end result looks something like this...

    Drug1       Drug2       Reaction            Frequency
1   tylenol     alcohol     hepatic failure     298
2   advil       aleve       bleeding            201 
3   aspirin     advil       renal failure       199
4   docusate    senna       diarrhea            146
5   senna       sudafed     palpitations        121
6   xanax       alcohol     sedation            111
7   clarinex    benadryl    dry mouth           96
...
569 ASPIRIN     CLARINEX    CHEST PAIN          2

Drug1 and Drug2 are the drug pairs with the highest frequency from the entire dataset. A "drug pair" is defined as any combination of two drugs with the same report ID. The example output above would be interpreted as, "row 1 had 298 unique report IDs for which hepatic failure was the reaction."

Ryan
  • 650
  • 6
  • 14
  • possible duplicate of [Counting unique pairs of categorical variables in R](http://stackoverflow.com/questions/24925306/counting-unique-pairs-of-categorical-variables-in-r) –  Aug 10 '14 at 14:32
  • How are you choosing drug1 and drug2? You should show how you want ASPIRIN and CLARINEX to be in the final table. – rnso Aug 10 '14 at 14:36
  • @mso, question updated to answer your questions. Thank you. – Ryan Aug 10 '14 at 15:01
  • This is not quite the same question as is being asked in the proposed duplicate, [Counting unique pairs of categorical variables in R](http://stackoverflow.com/questions/24925306/counting-unique-pairs-of-categorical-variables-in-r). I do not simply want to count the number of unique pairs of drugs, I want to count the number of drug pairs + reactions. First, though I need to generate all the pairs of drugs for a given report. – Ryan Aug 10 '14 at 15:08
  • It is still not clear. "row 1 had 298 unique report IDs for which hepatic failure was the reaction." for both tylenol and alcohol? Please describe in detail. – rnso Aug 10 '14 at 15:24
  • 1
    I still don't get the point - looking at your reputation, I assume you are not looking for help on functions such as `unique`, how to subset data (`subset`, `[`), and how to loop through/apply this on `unique(df$REACTION)`? Unless this is a rather **statistical/conceptual** question, can you clarify the **programming** question? – Martin Aug 10 '14 at 15:33
  • 1
    Each report *id* can have many *drugs* and many *reactions*. I want to take every possible pair (*drug1* + *drug2*) of *drugs* for each report *id* and then every *reaction* for each report *id* and increment a counter for any instance of these three (*frequency*). (keeping in mind that aspirin + clarinex is the same as clarinex + aspirin) – Ryan Aug 10 '14 at 15:36
  • @Martin, My reputation is not based on my knowledge of R, that's for sure. I'm still trying to wrap my head around what appears to be the "R mindset." The conceptual question is "should I be using a different data type or organizing the data differently" and the programming question is exactly the type of thing you mention. I'll try to sink my teeth into `unique` and `subset` a bit more. Thank you. – Ryan Aug 10 '14 at 15:43
  • each report ID: `unique(df$ID)` -> loop through or use an apply function – Martin Aug 10 '14 at 15:54
  • It seems your data is a full join between two datas: what drugs a patient took and what reactions they had. If that's the case, maybe you'll agree that's a bit of a waste. You could just store it into two named lists. It would also be easier to work with in my opinion. – flodel Aug 10 '14 at 16:23
  • Are these data to be obtained from "ISR","DRUGNAME","OUTC_CODE","PT" of the csv file? "OUTC_CODE" is "CA" throughout the file. Is "ISR" of csv file same as ID and "PT" same as 'reaction' here? – rnso Aug 10 '14 at 16:24
  • @rnso Yes. ISR = ID, DRUGNAME = DRUG, PT = REACTION. OUTC_CODE is not necessarily relevant to the question. It is a filter I was using to get info from the larger dataset. – Ryan Aug 10 '14 at 17:15

1 Answers1

2

Ok, I try an answer - I hope I got the question correctly. The code is rather intended to give some ideas than to be elegant/final.
Please note: I intentionally used for loops instead of possible vectorisation / apply functions, to make it easier to understand (those who are familiar with apply functions will also undertand the for loop ;-)).
Please note 2: Since I don't have more than a tiny piece of data, I could not test the code for the whole dataset!
EDIT: columns based on example above - possibly different from csv data.

Key points are:

  • unique, [ etc.
  • utils::combn to get combinations
  • sum(FALSE/TRUE values) to count

Hope that helps!

require(utils)

df <- read.table(header=TRUE, 
text="LINE ID DRUG REACTION
1 1827 ASPIRIN CHEST_PAIN
2 1827 CLARINEX CHEST_PAIN
3 1827 ASPIRIN COUGH
4 1827 CLARINEX COUGH
5 1827 ASPIRIN HAEMOGLOBIN_DECREASED
6 1827 CLARINEX HAEMOGLOBIN_DECREASED
7 1827 ASPIRIN NEUTROPHIL_COUNT_INCREASED
8 1827 CLARINEX NEUTROPHIL_COUNT_INCREASED
9 1827 ASPIRIN PHARYNGOLARYNGEAL_PAIN
10 1827 CLARINEX PHARYNGOLARYNGEAL_PAIN")

# temporary object to collect if a combination is present
Results <- data.frame(Drug1=NA, Drug2=NA, Reaction=NA, Reaction.occurs=NA)
n=1 # start first line in Results object

#  walk through each ID ... 
for (ID in unique(df$ID)) { 

  # ... and each possible pair of drugs within a (report) ID ...
  drug.pairs <- utils::combn(x=unique(df[df$ID == ID, "DRUG"]), m=2) # the columns 
  for (ii in 1:ncol(drug.pairs)) {

    # ... and each reaction ...
    for (reaction in unique(df$REACTION)) {
      Results[n, "Drug1"] <- drug.pairs[1,ii]
      Results[n, "Drug2"] <- drug.pairs[2,ii]
      Results[n, "Reaction"] <- reaction
      Results[n, "Reaction.occurs"] <- drug.pairs[1,ii] %in% df[df$REACTION == reaction & df$ID == ID, "DRUG"] &
        drug.pairs[2,ii] %in% df[df$REACTION == reaction & df$ID == ID, "DRUG"]
      n <- n+1
    }
  }
}

head(Results)

# then find the unique Drug1 - Drug2 -Reaction combinations, and count the TRUE values:
(Results[!duplicated(Results[,1:3]), ][,1:3])
(unique(Results[, 1:3]))

# Results2 contains only the unique combinations
Results2 <- Results[!duplicated(Results[,1:3]), ][,1:3]

# calculatethe frequencies
for (i in 1:nrow(Results2)) {
  Results2[i, "Frequency"] <- sum(Results[Results$Drug1 == Results2[i, "Drug1"] & 
                                            Results$Drug2 == Results2[i, "Drug2"] & 
                                            Results$Reaction == Results2[i, "Reaction"], ]$Reaction.occurs)
}

Results2
# --- end ----

gives:

    Drug1    Drug2                   Reaction Frequency
1 ASPIRIN CLARINEX                 CHEST_PAIN         1
2 ASPIRIN CLARINEX                      COUGH         1
3 ASPIRIN CLARINEX      HAEMOGLOBIN_DECREASED         1
4 ASPIRIN CLARINEX NEUTROPHIL_COUNT_INCREASED         1
5 ASPIRIN CLARINEX     PHARYNGOLARYNGEAL_PAIN         1
Martin
  • 594
  • 5
  • 16
  • I tested above code with data from csv file: Only 3 items have frequency of 1, all other frequencies are 0: LAMIVUDINE+STAVUDINE:"DRUG EXPOSURE DURING PREGNANCY 1" LAMIVUDINE+STAVUDINE:"HIP DYSPLASIA 1" LAMIVUDINE+STAVUDINE:"MATERNAL DRUGS AFFECTING FOETUS 1" (I have added +,: and "" for clarification). – rnso Aug 10 '14 at 17:11
  • No *exactly* what I'll end up using, but the techniques and functions gave me a great starting point to be able to learn how to frame this problem. The functions are dead on. Thanks for teaching! – Ryan Aug 12 '14 at 09:32