2

I have two dfs: one with all patients (df1) and one with a subset of those patients who are getting additional treatment (df2).

I want to join the two dfs, keeping all rows in df1 and adding a new column for “0/1” if the patient is not getting the additional treatment (0) or is getting the additional treatment (1). There shouldn’t be any issues of patients in df2 who are not in df1, but if there are, I want to catch them!

It's important I keep all of the rows in df1 as I want to compare results between df1 and df2

Simplified example:

df1

id      date        
0001    2/2/22
0002    1/1/22  
0002    3/2/22  
0003    3/2/22  
0003    5/2/22  
0004    6/7/22  
0004    9/2/22  

df2

id          
0002        
0004 

Desired output:

id      date    tx
0001    2/2/22  0
0002    1/1/22  1
0002    3/2/22  1
0003    3/2/22  0
0003    5/2/22  0
0004    6/7/22  1
0004    9/2/22  1

I know that I start with a join based on ID, but not sure the most elegant way to create the new column based on absence/presence of data in df2?

df1 %>% 
left_join(df2, by = "id")

I found this other question which is trying to do something similar, but I don't see how they joined the two datasets.

Create a presence/absence column based on presence records

Jackie
  • 43
  • 4

1 Answers1

1

In this case, you can simply create a boolean vector in base R, no need to join, merge, etc:

df$tx <- +(df$id %in% df2$id)

#    id   date tx
#1 0001 2/2/22  0
#2 0002 1/1/22  1
#3 0002 3/2/22  1
#4 0003 3/2/22  0
#5 0003 5/2/22  0
#6 0004 6/7/22  1
#7 0004 9/2/22  1

The +(...) makes it a 1/0, just doing df$tx <- df$id %in% df2$id will return TRUE/FALSE

Data

df <- structure(list(id = c("0001", "0002", "0002", "0003", "0003", 
"0004", "0004"), date = c("2/2/22", "1/1/22", "3/2/22", "3/2/22", 
"5/2/22", "6/7/22", "9/2/22")), row.names = c(NA, -7L), class = "data.frame")

df2 <- structure(list(id = c("0002", "0004")), class = "data.frame", row.names = c(NA, 
-2L))
jpsmith
  • 11,023
  • 5
  • 15
  • 36