9

I'm trying to get comfortable with using the Tidyverse, but data type conversions are proving to be a barrier. I understand that automatically converting strings to factors is not ideal, but sometimes I would like to use factors, so some approach to easily converting desired character columns in a tibble to factors would be excellent. I prefer to read in excel files with the readxl package, but factors aren't a permitted column type! I can go through column by column after the fact, but that's really not efficient. I want either of these two following things to work:

  1. Read in a file and simultaneously specify which columns should be read as factors:

     data <- read_excel(path = "myfile.xlsx", 
                        col_types=c(col2="factor", col5="factor)))
    
  2. Or this function would be excellent for many reasons, but I can't figure out how it's supposed to work. The col_types function is very confusing to me:

     diamonds <- col_types(diamonds, 
                           cols=c(cut="factor", color="factor", clarity="factor"))
    

Thanks in advance!

tef2128
  • 740
  • 1
  • 8
  • 19
  • Rather than try to force `readxl` to do something, you can use `dplyr` to just `mutate_if(data, is.character, as.factor)`. – Jake Kaupp Apr 19 '18 at 17:28
  • Thanks, that's helpful. Any thoughts on how to convert just the columns I want, rather than all of them? – tef2128 Apr 19 '18 at 17:30
  • You can use `mutate_at` to specify which names you would like to convert. If you really want to have the behaviour in a single function, you could make a wrapper to `read_excel` that coerces what columns you specify to factors. – Jake Kaupp Apr 19 '18 at 17:31
  • Brilliant! Thanks. This will do nicely. – tef2128 Apr 19 '18 at 17:34

1 Answers1

14

read_excel uses Excel cell types to guess column types for use in R. I also agree with the opinion of read_excel that one should read the data and allow a limited set of column types. Then if the user wishes, type conversion can take place later.

There is no function called col_types. That is a parameter name for read_excel. The tidyverse way would be:

library(tidyverse)

(foo <- data_frame(x = letters[1:3], y = LETTERS[4:6], z=1:3))
#> # A tibble: 3 x 3
#>   x     y         z
#>   <chr> <chr> <int>
#> 1 a     D         1
#> 2 b     E         2
#> 3 c     F         3

foo %>% 
  mutate_at(vars(x, y), factor)
#> # A tibble: 3 x 3
#>   x     y         z
#>   <fct> <fct> <int>
#> 1 a     D         1
#> 2 b     E         2
#> 3 c     F         3
ngm
  • 2,539
  • 8
  • 18
  • 1
    Thanks! This solution is perfect. But I have to argue only a little -- the col_types argument in read_excel allows conversions to most data types -- just not factors. Why not? Seems like a valuable addition to me ... – tef2128 Apr 19 '18 at 18:05
  • 3
    You'd have to ask Hadley! But if I would have to guess, I would take from https://cran.r-project.org/web/packages/readxl/vignettes/cell-and-column-types.html that `readxl` uses Excel cell types to guess variable types, and Excel just doesn't have anything like a factor. In principle I personally think that it's perhaps best for nothing to be a factor *until the moment you need it to be a factor*. Such as, to include something as a categorical variable in a model, or to make a barplot in `ggplot2`. – ngm Apr 19 '18 at 19:22