2

I have a dataset with multiple samples (columns) and variables (rows). I want to filter out a dataset to determine variables that are unique to a particular set of samples.

This is the sample data frame

dput(df)
structure(list(Description=c("k__Bacteria;__;__;__;__","k__Bacteria;p__Acidobacteria;c__[Chloracidobacteria];o__RB41;f__Ellin6075", 
"k__Bacteria;p__Acidobacteria;c__Acidobacteriia;o__Acidobacteriales;f__Koribacteraceae", 
"k__Bacteria;p__Acidobacteria;c__DA052;o__Ellin6513;f__", "k__Bacteria;p__Acidobacteria;c__Solibacteres;o__Solibacterales;f__", 
"k__Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__", 
"k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__", 
"k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Actinomycetaceae", 
"k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Actinopolysporaceae", 
"k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Corynebacteriaceae"
), ADZU.3 = c(2651L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 12L), ADZU.4 = c(2439L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 5L), BEP.3 = c(11452L, 9L, 5L, 
0L, 0L, 6L, 14L, 0L, 0L, 83L), BEP.4 = c(4168L, 0L, 0L, 9L, 3L, 
0L, 0L, 5L, 6L, 61L), Hya.1 = c(15179L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 94L), Hya.2 = c(4525L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
34L)), row.names = c(NA, 10L), class = "data.frame")

I am using the filter_at() function in dyplr, and have a code that works as intended. Below, I have many samples starting with different letters A, B, H, etc. I want to find variables that are unique to samples that start with the same letter (for example, letter B).

I have a code that currently works well

##code set 1, this code works

df.bep<-filter_at(df,vars(starts_with("A"),starts_with("H")), 
all_vars(.==0))

The result of this code is the following, which is what I expect to see:

dput(df.bep)
structure(list(Description = c("k__Bacteria;p__Acidobacteria;c__[Chloracidobacteria];o__RB41;f__Ellin6075", 
"k__Bacteria;p__Acidobacteria;c__Acidobacteriia;o__Acidobacteriales;f__Koribacteraceae", 
"k__Bacteria;p__Acidobacteria;c__DA052;o__Ellin6513;f__", "k__Bacteria;p__Acidobacteria;c__Solibacteres;o__Solibacterales;f__", 
"k__Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__", 
"k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__", 
"k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Actinomycetaceae", 
"k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Actinopolysporaceae"
), ADZU.3 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), ADZU.4 = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L), BEP.3 = c(9L, 5L, 0L, 0L, 6L, 14L, 
0L, 0L), BEP.4 = c(0L, 0L, 9L, 3L, 0L, 0L, 5L, 6L), Hya.1 = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L), Hya.2 = c(0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L)), row.names = c(NA, -8L), class = "data.frame")

This issue is that for longer datasets with many different samples, specifying every letter for every sample I want to filter_at() starts to get cumbersome to write.

I modified the script to use -starts_with() to try to filter the data frame by excluding samples that start with a specific letter I don't want to filter (for example filter all samples except those that start with letter B), such as:

df.bep.2<-filter_at(df,vars(-starts_with("B")),all_vars(.==0))

However, this second set of code doesn't work as intended. I do not get any errors, but instead I get an empty data frame

dput(df.bep.2)
structure(list(Description = character(0), ADZU.3 = integer(0), 
ADZU.4 = integer(0), BEP.3 = integer(0), BEP.4 = integer(0), 
Hya.1 = integer(0), Hya.2 = integer(0)), row.names = c(NA, 
0L), class = "data.frame")

is there something additional I need to put in the code when combining filter_at() and -starts_with()?

student001
  • 65
  • 1
  • 1
  • 6

1 Answers1

1

That means your condition in all_vars is not met in columns that do not start with "A". That filter is searching all columns that don't start with A and only selecting rows that contain all 0's.

For example, mtcars dataset will not return anything with this condition:

mtcars %>%
  filter_at(vars(-starts_with("q")), all_vars(. == 0))

 [1] mpg  cyl  disp hp   drat wt   qsec vs   am   gear carb
<0 rows> (or 0-length row.names)

Unless, we add a row with only 0's (although we could have a non-zero for the qsec column):

mtcars %>%
  bind_rows(setNames(rep(0, ncol(.)), names(.))) %>%
  filter_at(vars(-starts_with("q")), all_vars(. == 0))

  mpg cyl disp hp drat wt qsec vs am gear carb
1   0   0    0  0    0  0    0  0  0    0    0

EDIT: for your specific problem, it is because the column Description does not == 0. There are probably a couple solutions, but here are two below that should work for you!

df1 %>%
  filter_at(vars(-starts_with("B"), -one_of("Description")), all_vars(. == 0))

df1 %>%
  filter_if(sapply(., is.numeric) & !startsWith(names(.), "B"), all_vars(. == 0))
Andrew
  • 5,028
  • 2
  • 11
  • 21
  • Thank you for your answer Andrew. I have updated my question with sample data and intended output. As you can see, the first script does do what I intended, which is to find unique variables in samples that begin with a certain letter by filtering them out of the other samples in which they have a 0 value. Would you say that the two sets of code (the first having starts_with() and the second having -starts_with()) should essentially be performing the same process? I think they should be, but is it that they are actually doing fundamentally different things and I am mistaken? – student001 Aug 07 '19 at 21:48
  • @student001, just added an edit--hope it helps! Let me know if you have more questions! – Andrew Aug 08 '19 at 00:21
  • Thank you Andrew! the first notation worked (I didn't try the second notation but I'm sure it works as well). can you please explain what the %>% does? I have never seen this notation but it seems to be used all the time with the filter() functions in dyplr – student001 Aug 08 '19 at 00:33
  • @student001, great question. It is called a pipe--there are pipes in several programming languages but this was introduced by the magrittr package (and it loads with dplyr. It can be thought of as saying "and then". As you noted, it is very common when using functions from dplyr / tidyverse. If you want to read more about there is a chapter in [R for Data Science](https://r4ds.had.co.nz/pipes.html). It is a great intro book (not just that chapter). – Andrew Aug 08 '19 at 12:41
  • 1
    Thank you for your help Andrew! – student001 Aug 08 '19 at 17:42