4

I'm interested to filter an R datatable to select row elements corresponding to a list of numerical columns.

As an example, suppose I have :

DT<- data.table(a=c(1,2,3),b=c(4,5,6),c=c(7,8,9))

which gives,

   a b c
1: 1 4 7
2: 2 5 8
3: 3 6 9

Now, I have an external vector called select which contains columns I want selected corresponding to rows.

select <- c(2,3,1)

I would like it to return a new datatable that has each row value correspond to the selected column.

DTnew 
1: 4
2: 8
3: 3

If I try something like DT[,.SD[select]], it returns a new datatable with the entire rows corresponding to the select list.

> DT[,.SD[select]]
   a b c
1: 2 5 8
2: 3 6 9
3: 1 4 7

How can I accomplish this task?

edit: I didn't make it clear, but the result needs to retain the original order of the datatable rows, as it is a time series based object (I omitted the ts index to make the question simpler).

update 2: timing results of some posted solutions (using a datatable approach seems much faster on system time, not certain how to conclude results on user and elapsed time/overhead, but I also wanted to stay consistent with datatable the whole way. Guess I should have also asked whether it's common for DT users to go back and forth to matrix based computations when speed is priority).

library(data.table)
library(microbenchmark)

set.seed(123)

DT <- data.table(matrix(rnorm(10e3*10e3),nrow=10e3,ncol=10e3))
select<-sample(10e3,replace=FALSE)

op <- microbenchmark(
sol1 <- function(DT,select) DT[, V1 := .SD[[select]], by = select]$V1,

sol2 <- function(DT,select) {
x <- as.matrix(DT)
x[cbind(1:nrow(x), select)]
},

times=1000L)

Warning message:
In microbenchmark(sol1 <- function(DT, select) DT[, `:=`(V1, .SD[[select]]),  :
  Could not measure a positive execution time for 1019 evaluations.


> identical(sol1(DT,select),sol2(DT,select))
[1] TRUE
> op
Unit: nanoseconds
                                                                                    expr min lq   mean median uq   max neval cld
              sol1 <- function(DT, select) DT[, `:=`(V1, .SD[[select]]), by = select]$V1   0  0 25.136      0  1  9837  1000   a
 sol2 <- function(DT, select) {     x <- as.matrix(DT)     x[cbind(1:nrow(x), select)] }   0  0 52.477      0  1 39345  1000   a

method 2:

> system.time(replicate(10,sol1(DT,select)))
   user  system elapsed 
  64.07    0.25   64.33 
> system.time(replicate(10,sol2(DT,select)))
   user  system elapsed 
   4.97    2.25    7.22 
pat
  • 617
  • 8
  • 17
  • 2
    If you're interested in a lot of rowwise operations, you should probably consider a matrix. You could do `DT[ , mycol := .SD[[select]], by=select]`, though. – Frank Sep 17 '15 at 03:44
  • I suppose I could do DT[,mycol:=.SD[[select]],by=select][,mycol]. I'm dealing with much larger tables (maybe 10k X 100k) and speed is a priority, so I'm not sure if matrix operations lose performance there. – pat Sep 17 '15 at 03:52
  • To be honest, I'm not sure either. For the matrix approach, I meant what Matthew has posted below (but actually using a matrix, not a data.frame). By the way, selecting mycol at the end would be most efficient like `$mycol` instead of `[, mycol]` since `[]` has some overhead. Also, you can tag code in comments using backticks (on the tilde key in US keyboards). – Frank Sep 17 '15 at 03:53
  • `DT[,.SD[[select]],by=select]$V1` works well and might save time by not requiring the column assignment. I only need the filtered vector. – pat Sep 17 '15 at 04:03
  • That works in this case, but the `by` will group things together, so if your `select` is `c(1,2,2,1,3)` your result will be grouped like `c(1,1,2,2,3)` instead of the original order.. not sure if I'm being clear. – Frank Sep 17 '15 at 04:08
  • hmmm. order is important, as it is also an indexed time series. Does assigning the columns avoid that grouping? It should be in the same original indexed order. – pat Sep 17 '15 at 04:13
  • Yeah, assigning to a column avoids that. Another (hard to read) alternative that preserves ordering is `ave( seq(nrow(DT)), select, FUN = function(ii) DT[ii][[ select[ii][1] ]] )` – Frank Sep 17 '15 at 04:16
  • why is this a `data.table` instead of a matrix? – eddi Sep 17 '15 at 15:01
  • Here's a very similar question, maybe a dupe: http://stackoverflow.com/q/33310179/1191259 – Frank Nov 06 '15 at 19:37

2 Answers2

3

You can do this with a matrix using matrix indexing:

x <- as.matrix(DT)
x[cbind(1:nrow(x), select)]
## [1] 4 8 3

If you started out with a data frame, you can also index that with a matrix:

x <- data.frame(a=c(1,2,3),b=c(4,5,6),c=c(7,8,9)) # or as.data.frame(DT)
x[cbind(1:nrow(x), select)]
## [1] 4 8 3
Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112
  • 1
    if you're going to convert to `data.frame`, might as well do it in place using `setDF`, but I'm unconvinced that `data.*` is the right data structure to use here - I'd use regular matrices – eddi Sep 17 '15 at 15:05
1

A couple more options:

# extended example
DT       <- rbind(DT,DT)
select   <- c(select,rev(select))
expected <- c(4,8,3,1,8,6)

# create a new column with by
DT[, V1 := .SD[[select]], by = select]$V1

# or use ave
ave( seq(nrow(DT)), select, FUN = function(ii) DT[[ select[ii][1] ]][ii] )    

These are both basically doing the same thing: for each value v in select, grab the corresponding vector, DT[[v]]; and subset it to where select==v.

Frank
  • 66,179
  • 8
  • 96
  • 180