3

For those who are interested, I opened an issue on github.


Consider the following two examples:

> library(data.table)
> iris <- as.data.table(iris)

> # option 1
> iris[, c('Species', paste0(c('Sepal.', 'Petal.'), 'Length'))]
       Species Sepal.Length Petal.Length
  1:    setosa          5.1          1.4
  2:    setosa          4.9          1.4
  3:    setosa          4.7          1.3
  4:    setosa          4.6          1.5
  5:    setosa          5.0          1.4
 ---                                    
146: virginica          6.7          5.2
147: virginica          6.3          5.0
148: virginica          6.5          5.2
149: virginica          6.2          5.4
150: virginica          5.9          5.1

> # option 2
> iris[, c('Species', grep('Length', names(iris), value = TRUE))] 
[1] "Species"      "Sepal.Length" "Petal.Length"

The J expresssion is similar in option 1 and option 2, but the results are different. I know I can do it with the following way:

> # option 3
> x <- grep('Length', names(iris), value = TRUE)
> iris[, c('Species', ..x)]
       Species Sepal.Length Petal.Length
  1:    setosa          5.1          1.4
  2:    setosa          4.9          1.4
  3:    setosa          4.7          1.3
  4:    setosa          4.6          1.5
  5:    setosa          5.0          1.4
 ---                                    
146: virginica          6.7          5.2
147: virginica          6.3          5.0
148: virginica          6.5          5.2
149: virginica          6.2          5.4
150: virginica          5.9          5.1

However, I wonder why option 1 results in column selection while option 2 is evaluated into a character vector.

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.12.2

loaded via a namespace (and not attached):
[1] compiler_3.6.1 tools_3.6.1   
mt1022
  • 16,834
  • 5
  • 48
  • 71
  • 2
    In option 2 if `names(iris)` is replaced with `names(..iris)` then it gives the same result as option 1 so it seems that if the expression only contains constants and external names then it acts differently than otherwise. Also what the top function is, in this case `c`, seems to make a difference. If we replace `c(...)` with `(c(...))` in option 1 then it gives the same output as option 2. – G. Grothendieck Oct 28 '19 at 12:53
  • @G.Grothendieck, interesting observation. The newest data.table `NEWS.md` says that "When j is a symbol prefixed with .. it will be looked up in calling scope and its value taken to be column names or numbers.", so I thought `..` should only work with a vector. In the case of this post, the value of `..x` is a data.table, not column names or indices. This is kind of confusing. – mt1022 Oct 28 '19 at 13:15
  • 2
    I guess this is another attempt of data.table devs trying to make data.table more user friendly. The source seems to be [here](https://github.com/Rdatatable/data.table/blob/master/R/data.table.R#L221) where some functions such `paste` and `c` are being explicitly searched in the j expression and then set `with=FALSE`, similar to what happens with `..` prefix. `grep` isn't being searched and assumed that can be used for other things rather column selection – David Arenburg Oct 28 '19 at 13:26
  • @DavidArenburg, thanks for the information. Very helpful. It seems that this feature is still unmature. I think using the `(c(...))` syntax as mentaioned in Grothendieck's comment is more natural than explicitly searching for certain functions. – mt1022 Oct 28 '19 at 13:33
  • 2
    Related: [Select a sequence of columns: `:` works but not `seq`](https://stackoverflow.com/questions/41775462/select-a-sequence-of-columns-works-but-not-seq); creating column indices with `:` vs. `seq` (`:` works; `root == ":"`). The case of `grep` is also briefly mentioned. – Henrik Oct 28 '19 at 14:01
  • 1
    Somewhat related: [Different results when subsetting data.table columns with numeric indices in different ways](https://stackoverflow.com/questions/50314124/different-results-when-subsetting-data-table-columns-with-numeric-indices-in-dif) (similar underlying issue). – Henrik Oct 28 '19 at 14:05

0 Answers0