How to select columns by name or their standard deviation simultaneously?

Question

Solution

I went with the solution provided by @thelatemail because I'm trying to stick with tidyverse and thus dplyr--I'm still new to R, so I'm taking baby steps and taking advantage of helper libraries. Thank you everyone for taking the time to contribute solutions.

df_new <- df_inh %>%
select(
  isolate,
  Phenotype,
  which(
    sapply( ., function( x ) sd( x ) != 0 )
  )
)

Question

I'm trying to select columns if the column name is "isolate" or "Phenotype" or if the standard deviation of the column values is not 0.

I have tried the following code.

df_new <- df_inh %>%
# remove isolate and Phenotype column for now, don't want to calculate their standard deviation
select(
  -isolate,
  -Phenotype
) %>%
# remove columns with all 1's or all 0's by calculating column standard deviation
select_if(
  function( col ) return( sd( col ) != 0 )
) %>%
# add back the isolate and Phenotype columns
select(
  isolate,
  Phenotype
)

I also tried this

df_new <- df_inh %>%
select_if(
  function( col ) {
  if ( col == 'isolate' | col == 'Phenotype' ) {
    return( TRUE )
  }
  else {
    return( sd( col ) != 0 )
  }
}
)

I can select columns by standard deviation or by column name however I cannot do this simultaneously.

please make your example reporducible. You need to include at least a sample of the data using the `dput` command and adding the output into your question — morgan121, Apr 09 '19 at 03:41
Base R isn't too terrible for this - `dat[names(dat) %in% c("isolate","Phenotype") | sapply(dat, sd) != 0]` or the same logic in `dplyr` I suppose works too - `dat %>% select(isolate, Phenotype, which(sapply(., function(x) sd(x) != 0)))` — thelatemail, Apr 09 '19 at 03:43
@thelatemail what does the "." argument in the sapply() function denote, by the way? — Spencer A Lank, Apr 09 '19 at 05:50
@spence - `.` just represents the whole object passed in by `%>%`, in this case just the dataset `dat` — thelatemail, Apr 09 '19 at 05:57

score 4 · Answer 1 · answered Apr 09 '19 at 03:46

Not sure if you can do this with select_if alone but one way is to combine two select operation and then bind the columns. Using mtcars as sample data.

library(dplyr)
bind_cols(mtcars %>% select_if(function(x) sum(x) > 1000), 
          mtcars %>% select(mpg, cyl))

#    disp  hp  mpg cyl
#1  160.0 110 21.0   6
#2  160.0 110 21.0   6
#3  108.0  93 22.8   4
#4  258.0 110 21.4   6
#5  360.0 175 18.7   8
#6  225.0 105 18.1   6
#7  360.0 245 14.3   8
#8  146.7  62 24.4   4
#....

However, if a column satisfies both the condition (gets selected in select_if as well as select) then the column would be repeated.

We can also use base R which gives the same output but avoids column getting selected twice using unique.

sel_names <- c("mpg", "cyl")
mtcars[unique(c(sel_names, names(mtcars)[sapply(mtcars, sum) > 1000]))]

So for your case the two versions would be :

bind_cols(df_inh %>% select_if(function(x) sd(x) != 0), 
          df_inh %>% select(isolate, Phenotype))

and

sel_names <- c("isolate", "Phenotype")
df_inh[unique(c(sel_names, names(df_inh)[sapply(df_inh, sd) != 0]))]

Dij · Answer 2 · 2019-04-09T14:53:36.627

3

I wouldn't use tidyverse functions at all for this task.

df_new <- df_inh[,c(grep("isolate", names(df_inh)), 
                    grep("Phenotype", names(df_inh), 
                    which(sapply(df_inh, sd) != 0))]

Above, you just index using [] by each criteria using grep and which

edited Apr 09 '19 at 14:53

answered Apr 09 '19 at 03:45

Dij

1,318
1
7
13

How to select columns by name or their standard deviation simultaneously?

2 Answers2

Linked