0

I have been trying for hours and I can't figure it out. I have a data frame with subjects and conditions df1, from which I want to exclude observations which have a certain value (less than 3 in the variable "value" from df2. I cannot make it work because I need to remove from df1, combinations of different levels of two variables.

This is df1:

df1 <- structure(list(subject = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,2L, 2L, 2L, 2L, 
                                  2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), 
                      condition = c("A", "A", "A", "B", "B", "B", "C", "C","C", "A", "A", 
                                    "A", "B", "B", "B", "C", "C", "C", "A", "A", "A","B", "B", "B", "C", "C", "C")), 
                 row.names = c(NA, -27L), class = c("tbl_df", "tbl", "data.frame"))

And this is df2

df2 <- structure(list(subject = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L,4L, 4L, 4L, 5L, 5L, 5L), 
                      condition = c("A", "B", "C", "A", "B","C", "A", "B", "C", "A", "B", "C", "A", "B", "C"), 
                      value = c(10L, 8L, 7L, 3L, 8L, 5L, 3L, 3L, 9L, 8L, 7L, 8L, 10L, 6L, 2L)), 
                 row.names = c(NA,-15L), class = c("tbl_df", "tbl", "data.frame"))

And I want to remove in df1 all the combinations of subject and condition with a value under 3 so this would be the final df:

df3 <- structure(list(subject = c(2L, 3L, 3L, 5L), 
                      condition = c("A","A", "B", "C")), 
                 row.names = c(NA, -4L), 
                 class = c("tbl_df","tbl", "data.frame"))

So far I have been doing it like this, but I can't anymore because I have hundreds of rows...

df3 <- df1 %>% filter(!(subject==2 & condition=="A" |
                        subject==3 & (condition=="A" | condition=="B") |
                        subject==5 & condition=="C"))
Greg
  • 3,054
  • 6
  • 27
Lili
  • 47
  • 5
  • Hello, and thank you for the reproducible data. Your effort is especially appreciated in reformatting the text from `dput()`! Just to clarify: are you saying that your current solution is _inaccurate_, that it has poor _performance_ on hundreds of rows, or that it is _inextensible_ to more rows among hundreds? – Greg Jul 16 '21 at 14:43
  • Just FYI, I think you meant to say _"less than **[or equal to]** 3 in the variable `value` from `df2`"_. Only the inequality `<= 3` will give you the results you describe. – Greg Jul 16 '21 at 15:01
  • Hello. It has poor performance because it relies on me re-arranging df2 and then writing by hand the combinations of subject and condition at the top (so the ones less or equal to 3). – Lili Jul 16 '21 at 15:24

1 Answers1

2

Your sample result for df3 conflicts with the code you use to derive it, so here is a dplyr solution for each interpretation of what you want for df3.

Note: Both results are only possible when you

...exclude observations which have a certain value (less than [or equal to] 3 in the variable "value" from df2.

so I have implemented these solutions using the inequality <= 3 rather than < 3.

1st Interpretation of df3

To obtain the version of df3

# A tibble: 4 x 2
  subject condition
    <int> <chr>    
1       2 A        
2       3 A        
3       3 B        
4       5 C        

that you provide here as a sample result

And I want to remove in df1 all the combinations of subject and condition with a value under 3 so this would be the final df:

df3 <- structure(list(subject = c(2L, 3L, 3L, 5L), 
                      condition = c("A","A", "B", "C")), 
                 row.names = c(NA, -4L), 
                 class = c("tbl_df","tbl", "data.frame"))

simply use filter() on df2:

library(dplyr)


# ...
# Code to generate 'df1' and 'df2'.
# ...

df3 <- df2 %>% filter(value <= 3)

2nd Interpretation of df3

However, I it appears you actually desire the following version of df3

# A tibble: 18 x 2
   subject condition
     <int> <chr>    
 1       1 A        
 2       1 A        
 3       1 A        
 4       1 B        
 5       1 B        
 6       1 B        
 7       1 C        
 8       1 C        
 9       1 C        
10       2 B        
11       2 B        
12       2 B        
13       2 C        
14       2 C        
15       2 C        
16       3 C        
17       3 C        
18       3 C        

which you derive here:

df3 <- df1 %>% filter(!(subject==2 & condition=="A" |
                        subject==3 & (condition=="A" |condition=="B") |
                        subject==5 & condition=="C"))

In that case, you should anti_join() your df1 to a filter()ed version of df2:

library(dplyr)


# ...
# Code to generate 'df1' and 'df2'.
# ...


df3 <- df1 %>%
  anti_join(df2 %>% filter(value <= 3), by = c("subject", "condition"))
Greg
  • 3,054
  • 6
  • 27