0

I have a data frame in long format, that contains multiple entries per id. I also have a condition column that is either "app condition", "control condition" or NA. Each id has at least one "app condition" or "control condition" entry, but usually the rest are NAs. Now I need to filter out all id rows that belong to the app condition. So I need something like: if condition == "App condition" for id 5, remove all rows of id 5.

My df looks something like this:

ID Condition ....
A App condition
A NA
A NA
B Control condition
B NA
B Control condition
C NA
C App condition
D NA
D Control condition

And I want to keep all ID that have at least one "control condition" entry. So basically something like this:

ID Condition ....
B Control condition
B NA
B Control condition
D NA
D Control condition

My approach so far is using dplyr with

df <- df %>% 
   group_by(id) %>%
   filter(any(condition != "App condition")|is.na(condition))

But that also still returns IDs that belong to the app condition, but just removed these rows so that the NA rows of the same ID still stay in the data frame.

Can anybody help?

Thanks a lot!

Luca
  • 43
  • 4
  • Can you use `which` to create an index for when the condition is `TRUE` and then pass the index to `filter`? – edsandorf May 19 '21 at 08:10
  • Could you maybe add the code line for that? I am just starting with the more advanced R stuff and don't really have the hang on which and filter yet :) – Luca May 19 '21 at 17:55

3 Answers3

1

Using dplyr :

library(dplyr)

df %>%
  group_by(ID) %>%
  filter(any(Condition %in% 'Control condition')) %>%
  ungroup

#  ID    Condition        
#  <chr> <chr>            
#1 B     Control condition
#2 B     NA               
#3 B     Control condition
#4 D     NA               
#5 D     Control condition

In base R :

subset(df, ID %in% ID[Condition %in%'Control condition'])
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • This works, Thanks! I think all other approaches now work too, as I found out that my id column had some white spaces included for the NA rows, and so R didn't match the subsequent ID. Thanks everybody! – Luca May 20 '21 at 06:02
0

data.table approach

sample data

library(data.table)
DT <- fread('ID,Condition
A,"App condition"
A,NA
A,NA
B,"Control condition"
B,NA
B,"Control condition"
C,NA
C,"App condition"
D,NA
D,"Control condition"')

code

DT[!ID %in% DT[ Condition == "App condition", ID], ]

output

#    ID         Condition
# 1:  B Control condition
# 2:  B              <NA>
# 3:  B Control condition
# 4:  D              <NA>
# 5:  D Control condition
Wimpel
  • 26,031
  • 1
  • 20
  • 37
  • Thanks for the response! The code works and I also replicated it on your sample df, but weirdly on my original df it still leaves the rows of ids in that are NA but belong to the app condition. But I think I could now try to again remove all id rows that only have NA values in the condition column. Edit: Nevermind, if I run the exact same code but this time to keep the ids with "Control condition", it removes all NA rows from the condition column, also the ones that I need to keep – Luca May 19 '21 at 17:47
  • 1
    Edit: I realized my ID rows had some white spaces, so R didn't match the right IDs. This works fine! Thanks! – Luca May 20 '21 at 06:02
0

I'd had done something like

df = as.data.frame(cbind(c('a','a','a','b','b','b','c','c','d','d'),c('AC',NA,NA,'CC',NA,'CC',NA,'AC',NA,'CC')))

df = df[-which(df$V1%in%df[which(df$V2=='AC'),'V1']),]

it looks like this afterward

   V1   V2
4   b   CC
5   b <NA>
6   b   CC
9   d <NA>
10  d   CC

In your case, just replace V1 by ID and V2 by condition :)

Hope it helped

elielink
  • 1,174
  • 1
  • 10
  • 22