1

I have a data.table that resembles the one below.

tab <- data.table(a = c(NA, 42190, NA), b = c(42190, 42190, NA), c = c(40570, 42190, NA))
tab
       a     b     c
1:    NA 42190 40570
2: 42190 42190 42190
3:    NA    NA    NA

Upon specification of a vector of row indices, and a vector of column indices, I would like a vector returned containing the points in tab corresponding to the specified vector of row indices and column indices.

For example, suppose I wanted to get the diagonal elements in tab. I would specify two vectors,

ri <- 1:3
ci <- 1:3

and some function, function(ri, ci, tab), would return the diagonal elements of tab.

If tab were a data.frame, I would do what's below,

as.data.frame(tab)[cbind(ri, ci)]

but, I would like to avoid data.frame syntax. I would also like to avoid a for loop, as this tends to be slow.

Joshua Daly
  • 606
  • 1
  • 7
  • 16
  • 1
    maybe melt into a long format and set key? – chinsoon12 Jun 01 '18 at 02:04
  • Here are similar questions for looking up a matrix with a vector of row,col-indices: [Subsetting a matrix using a vector of indices](https://stackoverflow.com/questions/37921589/subsetting-a-matrix-using-a-vector-of-indices), [Use indices in vector to extract elements from matrix](https://stackoverflow.com/questions/33828165/use-indices-in-vector-to-extract-elements-from-matrix) – smci Jun 01 '18 at 02:06
  • Thanks @chinsoon12. Ideally, the solution would avoid transformation, and would be a direct index. – Joshua Daly Jun 01 '18 at 02:08
  • 2
    If you are dealing with all numeric data you're better off working with a `matrix` from the start. The indexing method you show on a matrix is very fast. – thelatemail Jun 01 '18 at 02:08
  • agree with @thelatemail that for the size of the data and number of subsetting calls, it is probably faster to use matrix. but i still believe that there will be a cross over point when data.table will be faster. – chinsoon12 Jun 01 '18 at 02:21
  • @chinsoon12: for sure; but if matrix suits the OP's task, they could make it the primary data-structure. If anyone has the energy to plot it, I'm curious what the graph looks like of DT vs matrix vs Matrix, for indexing an nxn with n arbitrary (row,col) indices. – smci Jun 01 '18 at 02:26
  • 1
    I would have titled this *"Taking a DT/matrix slice with a sequence of (row,col) indices"* – smci Jun 03 '18 at 09:56
  • Thanks @smci, I'll edit that. – Joshua Daly Jun 03 '18 at 10:09

2 Answers2

4

(UPDATE: @42-'s answer using [.data.frame is best. But here's my previous answer)

as.matrix(tab)[cbind(ri, ci)]

is going to be faster and more memory-efficient than melt.

I see no reason you don't declare your DT as a matrix, as @thelatemail recommends. This is one case where DT syntax is not as powerful as matrix.

(For memory-efficiency with large DTs, data.table has commands setDF/setDT to allow converting to/from DF/DT without copying, but I'm not aware it has an equivalent for matrix. If that is something people do a lot of, it might make a good enhance request for DT.

For really big dimensions, you might look into Matrix's sparse-matrix formats package), or chunk your data, or use disk-backed data structures.)

smci
  • 32,567
  • 20
  • 113
  • 146
  • 1
    Absolutely. matrix-indexing is one of the most overlooked functions of R imho. – thelatemail Jun 01 '18 at 02:09
  • Thanks @smci. `tab` is derived from a subset of a `data.table`, which is why I don't use a matrix to start. Is it quick going from a `data.table` to a matrix? – Joshua Daly Jun 01 '18 at 02:11
  • 1
    @JoshuaDaly: `as.matrix()` should be pretty fast, how large are your dimensions? If things get really big, you might look into Matrix's sparse-matrix formats, or chunk your data, or use disk-backed data structures. – smci Jun 01 '18 at 02:13
  • 1
    Thanks @smci, fortunately it's not too big (8 x 4 `data.table`), but, I will be performing the operation many times, so, I would like to avoid as much overhead as possible. But, your answer is excellent, and covers what to do when the data gets bigger, and I'll mark it as answered. – Joshua Daly Jun 01 '18 at 02:16
  • Cool. Take a look at [`Matrix`](https://www.rdocumentation.org/packages/Matrix) package – smci Jun 01 '18 at 02:19
  • 2
    Coercion is an unnecessary step. Leave it as a data.table (which inherits from data.frame) and use the function that is designed for data.frames. – IRTFM Mar 05 '19 at 01:06
  • @42- Awesome, didn't know of [`[.data.frame`](https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/Extract.data.frame) until now. – smci Mar 05 '19 at 02:12
4

There is a faster way to do this than coercing to either matrix or data.frame. Just use the [data.frame function.

`[.data.frame`( tab,  cbind(ri,ci) )
[1]    NA 42190    NA

This is the functional syntax for the [.data.frame function.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Awesome. Can you explain in your answer why `[.data.frame` works natively on a data.table, and link to [its doc](https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/Extract.data.frame)? – smci Mar 05 '19 at 02:11
  • R functions are dispatched by the class of their first argument. Run `class(DT)` and you get `[1] "data.table" "data.frame"`. Since the "[" function is generic and has a dataframe method, i.e. the `{.data.frame` function. So the interpreter passes a data.table object to the code of that function rather than to `[.data.table`. Look at the output from `methods(`[`)` – IRTFM Mar 05 '19 at 06:04