-1

I have a data.table and I'm trying to create a new column by checking to see if a row has particular values in any of a given set of columns.

head(d1)

   MEDREC_KEY   pat_key           drug1          drug2          drug3       drug4        drug5       drug6      drug7     drug8 drug9 drug10 drug11 drug12
1: -140665983 669723105 Anti-infectives Cephalosporins     Ethambutol   Isoniazid   Macrolides Penicillins Quinolones Rifamycin    NA     NA     NA     NA
2: -606290573  85924804 Anti-infectives   Beta-lactams Cephalosporins Penicillins   Quinolones          NA         NA        NA    NA     NA     NA     NA
3: -615873176 161009395  Cephalosporins    Penicillins             NA          NA           NA          NA         NA        NA    NA     NA     NA     NA
4: -616819481  36280536 Anti-infectives Cephalosporins     Macrolides  Quinolones           NA          NA         NA        NA    NA     NA     NA     NA
5: -625709819 720290063 Anti-infectives Cephalosporins     Ethambutol   Isoniazid Pyrazinamide  Quinolones  Rifamycin        NA    NA     NA     NA     NA
6: -637094857 720918635 Anti-infectives    Penicillins     Quinolones          NA           NA          NA         NA        NA    NA     NA     NA     NA

What I want to happen is if any of the "drug" columns == "Macrolides" AND any of the same columns == "Cephalosporins" then my new column, "correct" == 1 otherwise "correct" == 0 (or it could be logical), like so:

head(d1)
   MEDREC_KEY   pat_key           drug1          drug2          drug3       drug4        drug5       drug6      drug7     drug8 drug9 drug10 drug11 drug12 correct
1: -140665983 669723105 Anti-infectives Cephalosporins     Ethambutol   Isoniazid   Macrolides Penicillins Quinolones Rifamycin    NA     NA     NA     NA   1
2: -606290573  85924804 Anti-infectives   Beta-lactams Cephalosporins Penicillins   Quinolones          NA         NA        NA    NA     NA     NA     NA   0
3: -615873176 161009395  Cephalosporins    Penicillins             NA          NA           NA          NA         NA        NA    NA     NA     NA     NA   0
4: -616819481  36280536 Anti-infectives Cephalosporins     Macrolides  Quinolones           NA          NA         NA        NA    NA     NA     NA     NA   1
5: -625709819 720290063 Anti-infectives Cephalosporins     Ethambutol   Isoniazid Pyrazinamide  Quinolones  Rifamycin        NA    NA     NA     NA     NA   0
6: -637094857 720918635 Anti-infectives    Penicillins     Quinolones          NA           NA          NA         NA        NA    NA     NA     NA     NA   0

I've tried both of these (but am still learning how to decipher warning messages so those don't help much, especially as I am new to data.table):

> d1$correct<-ifelse(d1[,c(3:14)]=="Macrolides" | d1[,c(3:14)]=="Cephalosporins", 1, 0)
Warning messages:
1: In `[<-.data.table`(x, j = name, value = value) :
  12 column matrix RHS of := will be treated as one vector
2: In `[<-.data.table`(x, j = name, value = value) :
  Supplied 56868 items to be assigned to 4739 items of column 'correct' (52129 unused)
> 
> 
> selected_cols<-c("drug1", "drug2", "drug3", "drug4", "drug5", "drug6", "drug7", "drug8", "drug9", "drug10", "drug11", "drug12")
> d1$correct<-ifelse(d1 %in% selected_cols=="Macrolides" | d1 %in% selected_cols=="Cephalosporins", 1, 0)
Warning message:
In `[<-.data.table`(x, j = name, value = value) :
  Supplied 16 items to be assigned to 4739 items of column 'correct' (recycled leaving remainder of 3 items).

The closest I've gotten is this:

d1$correct<-apply(d1, 1, function(r) any(r %in% c("Macrolides", "Cephalosporins")))

Which will give TRUE if either of those is true across columns, but I can't figure out how to do it if both of those is true across columns. I'd prefer to not have to use a stunningly massive ifelse statement, since I have 12 columns and more combinations I'll need to make, and the NA's throw it off anyway.

I'd love a dplyr or data.table solution since those are so elegant, but at this point I'm desperate.

ericotta
  • 67
  • 7

3 Answers3

1

This should work:

d1$correct<-apply(d1, 1, function(r) { any(r =="Macrolides") & any(r == "Cephalosporins") })
Marcelo
  • 4,234
  • 1
  • 18
  • 18
  • Thank you! This works. I accepted Steven Beaupre's because it fills in all of the rows rather than leaving NAs where it isn't True. – ericotta Mar 29 '17 at 17:17
1

Here's an idea:

library(dplyr)
library(tidyr)

df %>%
  gather(key, value, -MEDREC_KEY, -pat_key) %>%
  group_by(MEDREC_KEY, pat_key) %>%
  mutate(correct = +all(c("Macrolides", "Cephalosporins") %in% value)) %>%
  spread(key, value)
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
  • 1
    Yep, that'll do it. I have never seen this +all before. Thanks so much! – ericotta Mar 29 '17 at 17:16
  • @ericotta From the `all` documentation: *Given a set of logical vectors, are all of the values true ?*. Here I use the `+` operator only to turn the `TRUE` and `FALSE` into `1` and `0` – Steven Beaupré Mar 29 '17 at 17:54
1

Okay I built an example and tried it out. Not a dplyr/tidyr method though.

d1 <- data.table::data.table(x_key = c(-101,-102,-103), y_key = c(669,668,667), 
                            drug1 = c("Macrolides",NA,"Macrolides"), 
                            drug2 = c(NA, "Cephalosporins", "Cephalosporins"))

   x_key y_key      drug1          drug2
1:  -101   669 Macrolides             NA
2:  -102   668         NA Cephalosporins
3:  -103   667 Macrolides Cephalosporins

d1$correct <- rowSums(apply(d1, 2, function(r) (r %in% c("Macrolides", "Cephalosporins")))[,-c(1:2)]*1)>=2
d1
   x_key y_key      drug1          drug2  correct
1:  -101   669 Macrolides             NA    FALSE
2:  -102   668         NA Cephalosporins    FALSE
3:  -103   667 Macrolides Cephalosporins     TRUE

The problem you were having is you were doing your apply with an index of 1, while you really wanted an index of 2. This checks if there are at least 2 TRUEs which will only work the way you want it to if you don't repeat the same drug (e.g., 2 Macrolides means 2 TRUES and thus, correct == TRUE).

Evan Friedland
  • 3,062
  • 1
  • 11
  • 25
  • The benefit here is you can create a character vector and place it into the function for quick searches: check <- c("Marcolides","Cephalosporins") rowSums(apply(d1, 2, function(r) (r %in% check))[,-c(1:2)]*1)>=2 One can do all kinds of drug combinations very quickly this way. – Evan Friedland Mar 30 '17 at 05:06
  • This is great, and definitely easy to modify for my other options! Thanks so much. – ericotta Mar 30 '17 at 13:23