49

I have a data frame that may or may not have some particular columns present. I want to select columns using dplyr if they do exist and, if not, just ignore that I tried to select them. Here's an example:

# Load libraries
library(dplyr)

# Create data frame
df <- data.frame(year = 2000:2010, foo = 0:10, bar = 10:20)

# Pull out some columns
df %>% select(year, contains("bar"))

# Result
#    year bar
# 1  2000  10
# 2  2001  11
# 3  2002  12
# 4  2003  13
# 5  2004  14
# 6  2005  15
# 7  2006  16
# 8  2007  17
# 9  2008  18
# 10 2009  19
# 11 2010  20

# Try again for non-existent column
df %>% select(year, contains("boo"))

# Result
#data frame with 0 columns and 11 rows

In the latter case, I just want to return a data frame with the column year since the column boo doesn't exist. My question is why do I get an empty data frame in the latter case and what is a good way of avoiding this and achieving the desired result?

EDIT: Session info

R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.5.0

loaded via a namespace (and not attached):
[1] lazyeval_0.2.0   magrittr_1.5     R6_2.2.0         assertthat_0.2.0 DBI_0.6-1        tools_3.3.3     
[7] tibble_1.3.0     Rcpp_0.12.10    
Dan
  • 11,370
  • 4
  • 43
  • 68
  • I can't reproduce this bug with my version of dplyr. Could you share the result of `sessionInfo()`? Also, could you try installing the development version of dplyr from GitHub (`devtools::install_github("tidyverse/dplyr")`) and see if that fixes it? – David Robinson May 04 '17 at 15:23
  • I've added the session info to the original question. – Dan May 04 '17 at 15:27
  • It gives correct output with dplyr ‘0.5.0.9001’. – mt1022 May 04 '17 at 15:28
  • 1
    This was a bug [reported here](https://github.com/tidyverse/dplyr/issues/1834) and fixed in February 2017. dplyr 0.6.0 is expected to come out soon on CRAN. – David Robinson May 04 '17 at 15:30
  • Aha! Thanks for the help. – Dan May 04 '17 at 15:32

3 Answers3

47

You can use any_of() (from the tidyselect package):

df %>% select(any_of(c("year", "boo")))
David Rubinger
  • 3,580
  • 1
  • 20
  • 29
  • I prefer this method because `select(any_of(c()))` respects the order that column names are listed in, and organizes them accordingly in the output. – Pake Dec 01 '21 at 18:06
  • hint: don't forget the '' around the column names inside any_of(c()). – Sapiens Jul 13 '23 at 15:30
42

In the devel version of dplyr

df %>%
   select(year, contains("boo"))
#     year
#1  2000
#2  2001
#3  2002
#4  2003
#5  2004
#6  2005
#7  2006
#8  2007
#9  2008
#10 2009
#11 2010

gives the expected output

Otherwise one option would be to use one_of

df %>%
   select(one_of("year", "boo"))

It returns a warning message if the column is not available

Other option is matches

df %>%
  select(matches("year|boo"))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    I don't think this answers the question, which is about an apparent bug in dplyr. `select(year, contains("boo"))` should include `year` in the output. – David Robinson May 04 '17 at 15:22
  • 1
    @Patronus I showed a way to get the expected output. His question is ` I just want to return a data frame with the column year since the column boo doesn't exist`. – akrun May 04 '17 at 15:23
  • 3
    Also works with "-". `%>% select(-one_of("not_wanted_variable"))` will remove `not_wanted_variable` from your data.frame – Lionel Trebuchon May 09 '19 at 05:51
  • Note that 'matches' returns any column that contains "year" or "boo", so if you have column names "year1", "year2", etc, those will all be returned. – Sylvia Rodriguez Jul 18 '21 at 00:11
  • 1
    @SylviaRodriguez this is an old post. You can modify those with `matches("^(year\\d+$)|boo")` – akrun Jul 18 '21 at 00:12
7

Here's a slight twist using dplyr::select_if() that will not throw an Unknown columns: warning if you try to select a column name does not exist, in this case 'bad_column':

df %>% 
  select_if(names(.) %in% c('year', 'bar', 'bad_column'))
sbha
  • 9,802
  • 2
  • 74
  • 62