0

I'm using read_excel for speed and simplicity to import an Excel file.

Unfortunately, there's as yet no capability of excluding selected columns which won't be needed from the data set; to save effort, I am naming such columns "x" with the col_names argument, which is easier than trying to keep track of x1, x2, and so on.

I'd then like to exclude such columns on the fly if possible to avoid an extra step of copying, so in pseudocode:

read_excel("data.xlsx", col_names = c("x", "keep", "x"))[ , !"x"]

We can use the sample data set included with the readxl package for illustration:

library(readxl)
DF <- read_excel(system.file("extdata/datasets.xlsx", package = "readxl"),
                 col_names = c("x", "x", "length", "width", "x"), skip = 1L)

The approaches I've seen that work don't exactly work on the fly, e.g., having stored DF, we can now do:

DF <- DF[ , -grep("^x$", names(DF))]

This works but requires making a copy of DF by storing it, then overwriting it; I'd prefer to remove the columns in the same command as read_excel to allocate DF properly ab initio.

Other similar approaches require declaring temporary variables, which I prefer to avoid if possible, e.g.,

col_names <- c("x", "x", "length", "width", "x")
DF <- read_excel(system.file("extdata/datasets.xlsx", package = "readxl"),
                 col_names = col_names, skip = 1L)[ , -grep("^x$", col_names)]

Is there a way to axe these columns without creating unnecessary temporary variables?

(I could convert to data.table, but am wondering if there's a way to do so without data.table)

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • 1
    Are you open to using `dplyr`? You can pipe that into `select()` without explicitly creating a variable first. (Of course, the complete data.frame will still exist whether or not you name it.) If the only problem is the extra variable, just create your own function `dropXcolumns()` or something. – MrFlick Jul 15 '16 at 20:17
  • 2
    About `requires making a copy of DF by storing it`: I guess that a copy is made even if you make it a one-liner, so I think yours is a no problem. The only way to avoid a copy would be if `read_excel` allowed it explicitly (something similar to the `select` argument of `data.table::fread`). – nicola Jul 15 '16 at 20:22
  • 1
    `XLConnnect::readWorksheet` (which has the better documentation, though you likely want `readWorksheetFromFile`) has `keep` and `drop` parameters for columns. It's Java-based, though, so installation is a little more complicated than usual. – alistaire Jul 15 '16 at 21:08
  • @alistaire `xlsx` is the same. I have my reasons for using `read_excel` here. – MichaelChirico Jul 15 '16 at 21:56

2 Answers2

2

I don't see an easy way to avoid copying. But a one liner is achievable using piping, needing no temporary variables. E.g.:

library(magrittr)
read_excel(system.file("extdata/datasets.xlsx", package = "readxl"), 
           col_names = c("x", "x", "length", "width", "x"), skip = 1L) %>% 
  extract(, -grep("^x$", names(.))) -> 
  DF
Axeman
  • 32,068
  • 8
  • 81
  • 94
2

There actually is a way to do this in readxl::read_excel, though it's a little hidden, and I have no idea if the columns are read into memory [temporarily] regardless. The trick is to specify column types, putting in "blank" for those you don't want:

readxl::read_excel(system.file("extdata/datasets.xlsx", package = "readxl"),
                   col_types = c('blank', 'blank', 'numeric', 'numeric', 'text'))
## # A tibble: 150 x 3
##    Petal.Length Petal.Width Species
##           <dbl>       <dbl>   <chr>
## 1           1.4         0.2  setosa
## 2           1.4         0.2  setosa
## 3           1.3         0.2  setosa
## 4           1.5         0.2  setosa
## 5           1.4         0.2  setosa
## 6           1.7         0.4  setosa
## 7           1.4         0.3  setosa
## 8           1.5         0.2  setosa
## 9           1.4         0.2  setosa
## 10          1.5         0.1  setosa
## # ... with 140 more rows

The caveat is that you need to know all the data types of the columns you want, though I suppose you could always just start with text and clean up later with type.convert or whatnot.

alistaire
  • 42,459
  • 4
  • 77
  • 117
  • Interesting. I've filed [#193](https://github.com/hadley/readxl/issues/193) with `readxl` as this is indeed quite poorly documented. – MichaelChirico Jul 15 '16 at 22:17
  • 1
    Anyway, if I understand the C++ code correctly, it's not reading anything for the `"blank"` columns, as happens (?) [here](https://github.com/hadley/readxl/blob/7c7f66b29764eee91385c9f1fc755b1ac6c10e08/src/CellType.h#L132-L141). Good work! – MichaelChirico Jul 15 '16 at 22:27