I'm interested to filter an R datatable to select row elements corresponding to a list of numerical columns.
As an example, suppose I have :
DT<- data.table(a=c(1,2,3),b=c(4,5,6),c=c(7,8,9))
which gives,
a b c
1: 1 4 7
2: 2 5 8
3: 3 6 9
Now, I have an external vector called select which contains columns I want selected corresponding to rows.
select <- c(2,3,1)
I would like it to return a new datatable that has each row value correspond to the selected column.
DTnew
1: 4
2: 8
3: 3
If I try something like DT[,.SD[select]]
, it returns a new datatable with the entire rows corresponding to the select list.
> DT[,.SD[select]]
a b c
1: 2 5 8
2: 3 6 9
3: 1 4 7
How can I accomplish this task?
edit: I didn't make it clear, but the result needs to retain the original order of the datatable rows, as it is a time series based object (I omitted the ts index to make the question simpler).
update 2: timing results of some posted solutions (using a datatable approach seems much faster on system time, not certain how to conclude results on user and elapsed time/overhead, but I also wanted to stay consistent with datatable the whole way. Guess I should have also asked whether it's common for DT users to go back and forth to matrix based computations when speed is priority).
library(data.table)
library(microbenchmark)
set.seed(123)
DT <- data.table(matrix(rnorm(10e3*10e3),nrow=10e3,ncol=10e3))
select<-sample(10e3,replace=FALSE)
op <- microbenchmark(
sol1 <- function(DT,select) DT[, V1 := .SD[[select]], by = select]$V1,
sol2 <- function(DT,select) {
x <- as.matrix(DT)
x[cbind(1:nrow(x), select)]
},
times=1000L)
Warning message:
In microbenchmark(sol1 <- function(DT, select) DT[, `:=`(V1, .SD[[select]]), :
Could not measure a positive execution time for 1019 evaluations.
> identical(sol1(DT,select),sol2(DT,select))
[1] TRUE
> op
Unit: nanoseconds
expr min lq mean median uq max neval cld
sol1 <- function(DT, select) DT[, `:=`(V1, .SD[[select]]), by = select]$V1 0 0 25.136 0 1 9837 1000 a
sol2 <- function(DT, select) { x <- as.matrix(DT) x[cbind(1:nrow(x), select)] } 0 0 52.477 0 1 39345 1000 a
method 2:
> system.time(replicate(10,sol1(DT,select)))
user system elapsed
64.07 0.25 64.33
> system.time(replicate(10,sol2(DT,select)))
user system elapsed
4.97 2.25 7.22