Matching values in multiple columns in R based on condition

Question

Say I have a datafame df

resident    faculty    submittedBy    match    caseID    phase

george      sally      george         1        george_1  pre
george      sally      sally          0        george_1  pre
george      sally      george         1        george_1  intra
jane        carl       jane           1        jane_1    pre
jane        carl       carl           0        jane_1    pre
jane        carl       carl           0        jane_1    intra

and I want to add a column df$response to this dataframe according to the following parameters (I'm thinking I need a set of nested ifelses, but I'm struggling to execute it correctly):

For a given row X, if df$match = 1,

print "1" in df$response if:

any row in df$match where df$match = 0 has the same contents in df$caseID, df$faculty, and df$phase as row X. Otherwise print "0".

So the output should be this:

response

1
0
0
1
0
0

because only the first and fourth rows contain values for which there are matches in df$caseID, df$faculty, and df$phase for both a row where df$match = 1 and a row where df$match = 0.

im not sure I follow...can you make the criteria into separate bullet points to distinguish between them better and just be a bit clearer in your wording? Maybe exlicity talk through why the first few responses are the way they are. — morgan121, Jul 09 '19 at 00:50
@soosus have you seen this [post](https://stackoverflow.com/questions/34642061/how-can-i-create-a-column-based-on-multiple-conditions)? — mnm, Jul 09 '19 at 00:51
how do row 1/4 match? row1: `faculty = sally`, `caseID = george_1` and `phase = pre`. none of these match? I cannpt follow your logic. What are they suppose to be matching? — morgan121, Jul 09 '19 at 01:05
@RAB i'm looking for a row X where match = 1 and then seeing if any row where match = 0 has the same values as row X for phase, faculty, and caseID (rows 1 and 2 fit this criterion because one of them has match = 1 and the other has match = 0 but phase, faculty, and caseID are the same) — soosus, Jul 09 '19 at 01:06

akrun · Answer 1 · 2019-07-09T02:03:10.777

We can use data.table methods. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'caseID', 'faculty', 'phase', get the length of unique elements of match check if it is equal to 2 and create a binary column ('response'), For values where 'match' is 0', assign the 'response' to 0

library(data.table)
setDT(df1)[, response := +((uniqueN(match) == 2) & match != 0), 
                  .(caseID, faculty, phase)][]
#   resident faculty submittedBy match   caseID phase response
#1:   george   sally      george     1 george_1   pre        1
#2:   george   sally       sally     0 george_1   pre        0
#3:   george   sally      george     1 george_1 intra        0
#4:     jane    carl        jane     1   jane_1   pre        1
#5:     jane    carl        carl     0   jane_1   pre        0
#6:     jane    carl        carl     0   jane_1 intra        0

Or using base R with ave

with(df1,+( match != 0 & ave(match, caseID, faculty, phase, 
         FUN = function(x) length(unique(x))) == 2))
#[1] 1 0 0 1 0 0

data

df1 <- structure(list(resident = structure(c(1L, 1L, 1L, 2L, 2L, 2L), 
.Label = c("george", 
"jane"), class = "factor"), faculty = structure(c(2L, 2L, 2L, 
1L, 1L, 1L), .Label = c("carl", "sally"), class = "factor"), 
    submittedBy = structure(c(2L, 4L, 2L, 3L, 1L, 1L), .Label = c("carl", 
    "george", "jane", "sally"), class = "factor"), match = c(1L, 
    0L, 1L, 1L, 0L, 0L), caseID = structure(c(1L, 1L, 1L, 2L, 
    2L, 2L), .Label = c("george_1", "jane_1"), class = "factor"), 
    phase = structure(c(2L, 2L, 1L, 2L, 2L, 1L), .Label = c("intra", 
    "pre"), class = "factor")), class = "data.frame", row.names = c(NA, 
-6L))

love the `ave` solution, very succinct. are you able to explain how exaclty it works? never seen it used that way — morgan121, Jul 09 '19 at 06:49
@RAB `ave` can be used as a replacement for group by + mutate from dplyr or `[, := , by =]` in data.table. The way it works is we pass the first argument as the column/vector to modify, here it is `match`, and the subsequent arguments are all grouping variable, until the `FUN`. If we dont pass any `FUN`, by default takes the `mean`. Good thing is that it won't change the original order. In this case, we are passing anonymous function call to check the `length` of `unique` elements of `match` in each of these groups. The `match != 0 &` is to replace values 0 in match to 0 — akrun, Jul 09 '19 at 12:49

score 2 · Answer 2 · edited Jul 09 '19 at 01:50

Indexing using [] is much faster, and less costly on your machine

df <- data.frame(
  "resident" = c("george","george","george","jane","jane","jane"),
  "faculty" = c("sally","sally","sally","carl","carl","carl"),
  "submittedBy" = c("george","sally","george","jane","carl","carl"),
  "match" = c(1,0,1,1,0,0),
  "caseID" = c("george_1","george_1","george_1","jane_1","jane_1","jane_1"),
  "phase" = c("pre","pre","intra","pre","pre","intra"),
  stringsAsFactors = FALSE
  )

response <- NULL

for (i in 1:nrow(df)) {
  response[i] <- ifelse(
    df$match[i] == 0, 0,
    ifelse(
      any(paste(df$caseID,df$faculty,df$phase,sep="")[df$match == 0] == 
            paste(df$caseID,df$faculty,df$phase,sep="")[i]),
      1, 0
    )
  )
}

response
[1] 1 0 0 1 0 0

I keep getting a "Error: unexpected '}' in:" error for this code — soosus, Jul 09 '19 at 01:21

score 2 · Accepted Answer · edited Jul 09 '19 at 01:50

Here is how I'd do it

# read the data
test <- read.table(text = 'resident    faculty    submittedBy    match    caseID    phase
                   george      sally      george         1        george_1  pre
                   george      sally      sally          0        george_1  pre
                   george      sally      george         1        george_1  intra
                   jane        carl       jane           1        jane_1    pre
                   jane        carl       carl           0        jane_1    pre
                   jane        carl       carl           0        jane_1    intra', header=T)

# create the response
resp <- logical(0)

# iterate over each loop
for (rr in 1:nrow(test)){
  if (test$match[rr] == 0){
    resp[rr] <- 0
  }
  else{
    tmp <- rbind(test[-rr, c('faculty', 'caseID', 'phase')],  # add the onto the end
                 test[rr, c('faculty', 'caseID', 'phase')])   # test if line is duplicated
    resp[rr] <- ifelse(duplicated(tmp)[nrow(tmp)], 1, 0)
  }
}

score 2 · Answer 4 · answered Jul 09 '19 at 01:35

Assuming you have only 1 and 0 values in match, one way with dplyr would be to check for every caseID, faculty and phase if there is two distinct values in match (1 and 0) and replace the response to 0 where match is 0.

library(dplyr)
df %>%
  group_by(caseID, faculty, phase) %>%
  mutate(response = as.integer(n_distinct(match) == 2),
         response = replace(response, match == 0, 0))

#  resident faculty submittedBy match caseID   phase response
#  <chr>    <chr>   <chr>       <dbl> <chr>    <chr>    <dbl>
#1 george   sally   george          1 george_1 pre          1
#2 george   sally   sally           0 george_1 pre          0
#3 george   sally   george          1 george_1 intra        0
#4 jane     carl    jane            1 jane_1   pre          1
#5 jane     carl    carl            0 jane_1   pre          0
#6 jane     carl    carl            0 jane_1   intra        0

score 2 · Answer 5 · answered Jul 09 '19 at 02:05

Another data.table approach. Join on the key variables and check if the values are not in the match==0 set:

library(data.table)
setDT(dat)

dat[, response := match==1]
dat[!dat[match==0], on=c("caseID","faculty","phase"), response := FALSE]

dat
#   resident faculty submittedBy match   caseID phase response
#1:   george   sally      george     1 george_1   pre     TRUE
#2:   george   sally       sally     0 george_1   pre    FALSE
#3:   george   sally      george     1 george_1 intra    FALSE
#4:     jane    carl        jane     1   jane_1   pre     TRUE
#5:     jane    carl        carl     0   jane_1   pre    FALSE
#6:     jane    carl        carl     0   jane_1 intra    FALSE

Matching values in multiple columns in R based on condition

5 Answers5

data