R: Subset from two data frames based on multiple conditions

Question

I have two dataframes (df1 and df2), and I want a new dataframe (df3) containing all rows where "date" AND "time_of_day" of df1 match with df2. And save the rows of df1 that don't match as well in a new dataframe (df4).

I tried using dplyr filter function, but it seems like I am not writing it correctly, as I am getting a new dataframe of the same length as df1 but it should show me only the matching rows based on both variables date and time of day.

> df1
          date time_of_day     
1  2018-06-03     morning 
2  2018-06-06     afternoon 
4  2018-06-09     morning 
5  2018-06-10     afternoon 

> df2
          date time_of_day     
1  2018-06-03     morning 
2  2018-06-06     morning 
3  2018-06-08     morning 
4  2018-06-09     morning 
5  2018-06-10     afternoon
6  2018-06-11     afternoon

#creating a new data frame
df3 <- filter(df1, date %in% df2$date & time_of_day %in% df2$time_of_day)
#another try 
df3 <- df1[df1$date %in% df2$date & df1$time_of_day %in% df2$time_of_day,]

This is what I want:

> df3
          date time_of_day     
1  2018-06-03     morning 
2  2018-06-09     morning 
3  2018-06-10     afternoon 

> df4
          date time_of_day     
1  2018-06-06     afternoon

Try `inner_join(df1, df2)` and `anti_join(df1, df2)` – akrun Jun 06 '19 at 15:39 — akrun, Jun 06 '19 at 15:39

score 3 · Answer 1 · answered Jun 06 '19 at 15:40

3

We can do this with inner_join

library(dplyr)
df3 <- inner_join(df1, df2)
df3
#       date time_of_day
#1 2018-06-03     morning
#2 2018-06-09     morning
#3 2018-06-10   afternoon

and anti_join

df4 <- anti_join(df1, df2)
df4
#       date time_of_day
#1 2018-06-06   afternoon

data

df1 <- structure(list(date = c("2018-06-03", "2018-06-06", "2018-06-09", 
"2018-06-10"), time_of_day = c("morning", "afternoon", "morning", 
"afternoon")), class = "data.frame", row.names = c("1", "2", 
"4", "5"))

df2 <- structure(list(date = c("2018-06-03", "2018-06-06", "2018-06-08", 
"2018-06-09", "2018-06-10", "2018-06-11"), time_of_day = c("morning", 
"morning", "morning", "morning", "afternoon", "afternoon")),
class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6"))

answered Jun 06 '19 at 15:40

akrun

874,273
37
540
662

thank you so much, I think that is going to help, the only problem I have now, is that in my real df2 are more columns than in df1, so I don't achieve the result that I need. Do you know how I can solve that? – chiarasahara Jun 06 '19 at 15:53
@chiarasahara Just subset the columns of 'df2' i.e. only use the columns to join `inner_join(df1, df2[c("date", "time_of_day")])` – akrun Jun 06 '19 at 15:54
I can't really seem to get the same output with my real data. The structure is the same. But in my df3 are all dates in double or triple version now. You have an idea why? – chiarasahara Jun 06 '19 at 16:04
@chiarasahara Please check if there are dupes in both datasets for the joining columns. In that case, it would make the program difficult to know which row to match resulting in duplicate matches – akrun Jun 06 '19 at 16:05
yes in the df2 are often more than one possibility for the same date unfortunately – chiarasahara Jun 06 '19 at 16:08
1

@chiarasahara In that case, you may need to take a decision on which row to match. You can use `distinct` to get the distinct rows where it keeps the first row and remove the duplicate, ie. `distinct(df2[c("date", "time_of_day")])` and then do the join. But, the decision depends on your problem, and the logic you want to implement – akrun Jun 06 '19 at 16:14
1

wow, now I get my result, thank you so much for your patience :) you made my day! – chiarasahara Jun 06 '19 at 16:19

Andrew · Answer 2 · 2019-06-06T17:41:58.823

Altering your base R code, you could do this (below). And you can wrap either in unique() if you want duplicate rows removed.

df1[paste0(df1$date, df1$time_of_day) %in% paste0(df2$date, df2$time_of_day), ]
        date time_of_day
1 2018-06-03     morning
4 2018-06-09     morning
5 2018-06-10   afternoon

and

df1[!paste0(df1$date, df1$time_of_day) %in% paste0(df2$date, df2$time_of_day), ]
        date time_of_day
2 2018-06-06   afternoon

Your attempts before did not work because df1$date %in% df2$date & df1$time_of_day %in% df2$time_of_day evaluates to TRUE TRUE TRUE TRUE. So it retained all rows. That is to say: all dates in df1 are in df2 & all times of day in df1 are in df2.

EDIT:

Alternatively, in dplyr you could use intersect and setdiff which work with dataframes and remove duplicates:

dplyr::intersect(df1, df2)
        date time_of_day
1 2018-06-03     morning
2 2018-06-09     morning
3 2018-06-10   afternoon

dplyr::setdiff(df1, df2)
        date time_of_day
1 2018-06-06   afternoon

R: Subset from two data frames based on multiple conditions

2 Answers2

data