1

I have a data.table object, which I'm piping through few dplyr functions.
When passed through dplyr::select, the class of resulting object is data.table + data.frame.
But when I pass it through dplyr::filter, class of output is data.frame only.

mtcars_dt = data.table(mtcars)         # "data.table" "data.frame"

mtcars_dt %>% dplyr::select(hp, mpg) %>% class # "data.table" "data.frame"
mtcars_dt %>% dplyr::filter(hp > 100) %>% class # "data.frame"

Why does it happen and how do I ensure the data.table class is retained while using dplyr::filter?

Henrik
  • 65,555
  • 14
  • 143
  • 159
Ashrith Reddy
  • 1,022
  • 1
  • 13
  • 26
  • 9
    If you are using data.table, why can't data.table methods be used i.e. `mtcars_dt[hp>100]` or in a pipe `mtcars_dt %>% .[hp>100]` – akrun Mar 19 '18 at 11:12
  • 3
    Isn't @akrun suggestion "piping"? He just uses `data.table` syntax instead of `dplyr` functions. – pogibas Mar 19 '18 at 11:15
  • 5
    You can chain with data.table too: `mtcars_dt[, .(hp, mpg)][hp > 100]` – Jaap Mar 19 '18 at 11:16
  • 1
    First, I entirely agree with @akrun and Jaap here (use `data.table` methods!), Still, _if_ you absolutely want to use `dplyr` functions on a `data.table` and retain class `data.table`, you may use `dtplyr::tbl_dt` to "Create A Data Table Tbl". Disclaimer: I have not used the function, I have just noted that it existed. I don't know if you will run into trouble at some point having `class` `"tbl_dt" "tbl"`, in addition to `"data.table" "data.frame"`. – Henrik Mar 19 '18 at 12:55
  • 3
    In addition, please note [the `dtplyr` notes](https://github.com/hadley/dtplyr): "dtplyr will always be a bit slower than data.table, because it creates copies of objects rather than mutating in place (that's the dplyr philosophy). Currently, dtplyr is quite a lot slower than bare data.table because the methods aren't quite smart enough.". Thus, better to just stick to `data.table` methods. – Henrik Mar 19 '18 at 12:57
  • 1
    @Henrik Maybe you should post an answer. Fwiw, after I load dtplyr, the OP's code gives the same class vector for both lines. – Frank Mar 19 '18 at 14:20

1 Answers1

4

Originally, I thought it would be necessary to explicitly convert the data.table to a "data table tbl", using tbl_dt, to retain class data.table:

library(data.table)
library(dtplyr)
library(magrittr)

mtcars_dt %>% tbl_dt() %>% dplyr::select(hp, mpg) %>% class
# [1] "tbl_dt"     "tbl"        "data.table" "data.frame"

mtcars_dt %>% tbl_dt() %>% dplyr::filter(hp > 100) %>% class
# [1] "tbl_dt"     "tbl"        "data.table" "data.frame"

However, as pointed out by Frank in the comments, merely loading dtplyr is enough:

mtcars_dt %>% dplyr::select(hp, mpg) %>% class
# [1] "data.table" "data.frame"

mtcars_dt %>% dplyr::filter(hp > 100) %>% class
# [1] "data.table" "data.frame"

Weird. Or? I posted a dtplyr issue, so hopefully some dtplyr aficionados can shed some light on this.


The .data argument and Value are the same in ?filter and ?select, so from this information only it's hard to tell why .data of class data.table is treated differently in the two functions.


After this little excerise, I would still argue that you should stick to data.table syntax. In particular, you can chain operations:

mtcars_dt[ , .(hp, mpg)][hp > 100]
# or
mtcars_dt[j = .(hp, mpg)][i = hp > 100]
Henrik
  • 65,555
  • 14
  • 143
  • 159