How to filter R datatable based on external column vector

Question

I'm interested to filter an R datatable to select row elements corresponding to a list of numerical columns.

As an example, suppose I have :

DT<- data.table(a=c(1,2,3),b=c(4,5,6),c=c(7,8,9))

which gives,

Now, I have an external vector called select which contains columns I want selected corresponding to rows.

select <- c(2,3,1)

I would like it to return a new datatable that has each row value correspond to the selected column.

DTnew 
1: 4
2: 8
3: 3

If I try something like DT[,.SD[select]], it returns a new datatable with the entire rows corresponding to the select list.

> DT[,.SD[select]]
   a b c
1: 2 5 8
2: 3 6 9
3: 1 4 7

How can I accomplish this task?

edit: I didn't make it clear, but the result needs to retain the original order of the datatable rows, as it is a time series based object (I omitted the ts index to make the question simpler).

update 2: timing results of some posted solutions (using a datatable approach seems much faster on system time, not certain how to conclude results on user and elapsed time/overhead, but I also wanted to stay consistent with datatable the whole way. Guess I should have also asked whether it's common for DT users to go back and forth to matrix based computations when speed is priority).

library(data.table)
library(microbenchmark)

set.seed(123)

DT <- data.table(matrix(rnorm(10e3*10e3),nrow=10e3,ncol=10e3))
select<-sample(10e3,replace=FALSE)

op <- microbenchmark(
sol1 <- function(DT,select) DT[, V1 := .SD[[select]], by = select]$V1,

sol2 <- function(DT,select) {
x <- as.matrix(DT)
x[cbind(1:nrow(x), select)]
},

times=1000L)

Warning message:
In microbenchmark(sol1 <- function(DT, select) DT[, `:=`(V1, .SD[[select]]),  :
  Could not measure a positive execution time for 1019 evaluations.


> identical(sol1(DT,select),sol2(DT,select))
[1] TRUE
> op
Unit: nanoseconds
                                                                                    expr min lq   mean median uq   max neval cld
              sol1 <- function(DT, select) DT[, `:=`(V1, .SD[[select]]), by = select]$V1   0  0 25.136      0  1  9837  1000   a
 sol2 <- function(DT, select) {     x <- as.matrix(DT)     x[cbind(1:nrow(x), select)] }   0  0 52.477      0  1 39345  1000   a

method 2:

> system.time(replicate(10,sol1(DT,select)))
   user  system elapsed 
  64.07    0.25   64.33 
> system.time(replicate(10,sol2(DT,select)))
   user  system elapsed 
   4.97    2.25    7.22

If you're interested in a lot of rowwise operations, you should probably consider a matrix. You could do `DT[ , mycol := .SD[[select]], by=select]`, though. — Frank, Sep 17 '15 at 03:44
I suppose I could do DT[,mycol:=.SD[[select]],by=select][,mycol]. I'm dealing with much larger tables (maybe 10k X 100k) and speed is a priority, so I'm not sure if matrix operations lose performance there. — pat, Sep 17 '15 at 03:52
To be honest, I'm not sure either. For the matrix approach, I meant what Matthew has posted below (but actually using a matrix, not a data.frame). By the way, selecting mycol at the end would be most efficient like `$mycol` instead of `[, mycol]` since `[]` has some overhead. Also, you can tag code in comments using backticks (on the tilde key in US keyboards). — Frank, Sep 17 '15 at 03:53
`DT[,.SD[[select]],by=select]$V1` works well and might save time by not requiring the column assignment. I only need the filtered vector. — pat, Sep 17 '15 at 04:03
That works in this case, but the `by` will group things together, so if your `select` is `c(1,2,2,1,3)` your result will be grouped like `c(1,1,2,2,3)` instead of the original order.. not sure if I'm being clear. — Frank, Sep 17 '15 at 04:08
hmmm. order is important, as it is also an indexed time series. Does assigning the columns avoid that grouping? It should be in the same original indexed order. — pat, Sep 17 '15 at 04:13
Yeah, assigning to a column avoids that. Another (hard to read) alternative that preserves ordering is `ave( seq(nrow(DT)), select, FUN = function(ii) DT[ii][[ select[ii][1] ]] )` — Frank, Sep 17 '15 at 04:16
Here's a very similar question, maybe a dupe: http://stackoverflow.com/q/33310179/1191259 — Frank, Nov 06 '15 at 19:37

Matthew Lundberg · Answer 1 · 2015-09-17T16:07:35.337

3

You can do this with a matrix using matrix indexing:

x <- as.matrix(DT)
x[cbind(1:nrow(x), select)]
## [1] 4 8 3

If you started out with a data frame, you can also index that with a matrix:

x <- data.frame(a=c(1,2,3),b=c(4,5,6),c=c(7,8,9)) # or as.data.frame(DT)
x[cbind(1:nrow(x), select)]
## [1] 4 8 3

edited Sep 17 '15 at 16:07

answered Sep 17 '15 at 03:53

Matthew Lundberg

42,009
6
90
112

1

if you're going to convert to `data.frame`, might as well do it in place using `setDF`, but I'm unconvinced that `data.*` is the right data structure to use here - I'd use regular matrices – eddi Sep 17 '15 at 15:05

score 1 · Accepted Answer · answered Sep 17 '15 at 12:50

A couple more options:

# extended example
DT       <- rbind(DT,DT)
select   <- c(select,rev(select))
expected <- c(4,8,3,1,8,6)

# create a new column with by
DT[, V1 := .SD[[select]], by = select]$V1

# or use ave
ave( seq(nrow(DT)), select, FUN = function(ii) DT[[ select[ii][1] ]][ii] )

These are both basically doing the same thing: for each value v in select, grab the corresponding vector, DT[[v]]; and subset it to where select==v.

How to filter R datatable based on external column vector

2 Answers2