How to select specific observations from columns based on partial string match of column names

Question

My dataset a large number of columns starting with "dis....".

The values in the columns are either 0 (without disease) or 1 (with disease). I would like create a dataframe of observations with 1 for a specific disease and 0 for everything else.

I have tried the following:

istroke <- filter(onlyCRP, dis_ep0009 == 1 & grep("dis_" == 0))

and in combination with select:

istroke1 <- filter(onlyCRP, dis_ep0009 == 1 & select(contains("dis_") == 0))

As you'd guess, neither of them work.

I have looked at these posts:

filtering columns by regex in dataframe

Subset data based on partial match of column names

But they don't answer my question.

Please let me know if you require further clarifications.

Edit I realized I needed to clarify further what I wanted. Consider this table:

dis_ep0009  dis_epxxx   dis_epxxx
 0            0             0
 0            1             0  
 0            0             1
 1            0             1
 0            0             0
 0            0             0
 1            1             1

I need another column, e.g - IS according to some conditions of these 3 columns (I actually have 29 of these "dis_" columns):

If dis_ep0009 == 1, then IS == 1 (regardless of 0 or 1 on any other "dis.." columns).
if dis_ep0009 == 0 and dis_epxxx == 1, I want to drop these observations
if dis_ep0009 == 0 and dis_epxxx == 0, I want to code IS == 0.

So the resulting table should look like this:

dis_ep0009  dis_epxxx   dis_epxxx    IS
 0            0             0        0
 0            1             0        drop
 0            0             1        drop
 1            0             1        1
 0            0             0        0
 0            0             0        0
 1            1             1        1

I have tried pairing filter (dplyr) with grep and ifelse statements but can't make head or tails of it. In essence, it should be something simple like this (not meant to work):

istroke <- filter(df, ifelse(dis_ep0009 == 1, 1, ifelse(dis_ep0009 == 0 & grep("dis_", names(df)) == 0, 0, ifelse(dis_ep0009 == 0 & grep("dis_", names(df)) == 1, drop())))

Thanks in advance!

moodymudskipper · Answer 1 · 2017-06-06T15:54:13.413

0

See comments in code, and tell me if that's what you want

specific_disease <- "dis_ep0009"
disease_cols <- grep("dis",names(onlyCRP),value=TRUE) # all columns containing "dis"
disease_cols <- setdiff(disease_cols,specific_disease) # all these columns except your specific disease
onlyCRP$any_other_disease <- apply(onlyCRP[,disease_cols]==1,1,any) # a Boolean column saying if there is another disease besides the possible specific one
onlyCRP[onlyCRP$specific_disease == 1 & !onlyCRP$any_other_disease,] # the subset where you'll have only your specific disease and no other

edited Jun 06 '17 at 15:54

answered Jun 06 '17 at 15:46

moodymudskipper

46,417
11
121
167

I guess I need to clarify further: I would like all observations that are coded 1 for dis_ep0009 and 0 in other "dis.." columns. The extra boolean column does not appear to serve that purpose. Nonetheless, I created a df with the sequence and it had 0 observations. I would also appreciate something more simpler, preferably using dplyr-based codes. Thanks. – Mak Jun 06 '17 at 16:14

How to select specific observations from columns based on partial string match of column names

1 Answers1