Taking a data.table slice with a sequence of (row,col) indices

Question

I have a data.table that resembles the one below.

tab <- data.table(a = c(NA, 42190, NA), b = c(42190, 42190, NA), c = c(40570, 42190, NA))
tab
       a     b     c
1:    NA 42190 40570
2: 42190 42190 42190
3:    NA    NA    NA

Upon specification of a vector of row indices, and a vector of column indices, I would like a vector returned containing the points in tab corresponding to the specified vector of row indices and column indices.

For example, suppose I wanted to get the diagonal elements in tab. I would specify two vectors,

ri <- 1:3
ci <- 1:3

and some function, function(ri, ci, tab), would return the diagonal elements of tab.

If tab were a data.frame, I would do what's below,

as.data.frame(tab)[cbind(ri, ci)]

but, I would like to avoid data.frame syntax. I would also like to avoid a for loop, as this tends to be slow.

Here are similar questions for looking up a matrix with a vector of row,col-indices: [Subsetting a matrix using a vector of indices](https://stackoverflow.com/questions/37921589/subsetting-a-matrix-using-a-vector-of-indices), [Use indices in vector to extract elements from matrix](https://stackoverflow.com/questions/33828165/use-indices-in-vector-to-extract-elements-from-matrix) — smci, Jun 01 '18 at 02:06
Thanks @chinsoon12. Ideally, the solution would avoid transformation, and would be a direct index. — Joshua Daly, Jun 01 '18 at 02:08
If you are dealing with all numeric data you're better off working with a `matrix` from the start. The indexing method you show on a matrix is very fast. — thelatemail, Jun 01 '18 at 02:08
agree with @thelatemail that for the size of the data and number of subsetting calls, it is probably faster to use matrix. but i still believe that there will be a cross over point when data.table will be faster. — chinsoon12, Jun 01 '18 at 02:21
@chinsoon12: for sure; but if matrix suits the OP's task, they could make it the primary data-structure. If anyone has the energy to plot it, I'm curious what the graph looks like of DT vs matrix vs Matrix, for indexing an nxn with n arbitrary (row,col) indices. — smci, Jun 01 '18 at 02:26
I would have titled this *"Taking a DT/matrix slice with a sequence of (row,col) indices"* — smci, Jun 03 '18 at 09:56

smci · Answer 1 · 2019-03-05T02:13:42.223

4

(UPDATE: @42-'s answer using [.data.frame is best. But here's my previous answer)

as.matrix(tab)[cbind(ri, ci)]

is going to be faster and more memory-efficient than melt.

I see no reason you don't declare your DT as a matrix, as @thelatemail recommends. This is one case where DT syntax is not as powerful as matrix.

(For memory-efficiency with large DTs, data.table has commands setDF/setDT to allow converting to/from DF/DT without copying, but I'm not aware it has an equivalent for matrix. If that is something people do a lot of, it might make a good enhance request for DT.

For really big dimensions, you might look into Matrix's sparse-matrix formats package), or chunk your data, or use disk-backed data structures.)

edited Mar 05 '19 at 02:13

answered Jun 01 '18 at 02:08

smci

32,567
20
113
146

1

Absolutely. matrix-indexing is one of the most overlooked functions of R imho. – thelatemail Jun 01 '18 at 02:09
Thanks @smci. `tab` is derived from a subset of a `data.table`, which is why I don't use a matrix to start. Is it quick going from a `data.table` to a matrix? – Joshua Daly Jun 01 '18 at 02:11
1

@JoshuaDaly: `as.matrix()` should be pretty fast, how large are your dimensions? If things get really big, you might look into Matrix's sparse-matrix formats, or chunk your data, or use disk-backed data structures. – smci Jun 01 '18 at 02:13
1

Thanks @smci, fortunately it's not too big (8 x 4 `data.table`), but, I will be performing the operation many times, so, I would like to avoid as much overhead as possible. But, your answer is excellent, and covers what to do when the data gets bigger, and I'll mark it as answered. – Joshua Daly Jun 01 '18 at 02:16
Cool. Take a look at [`Matrix`](https://www.rdocumentation.org/packages/Matrix) package – smci Jun 01 '18 at 02:19
2

Coercion is an unnecessary step. Leave it as a data.table (which inherits from data.frame) and use the function that is designed for data.frames. – IRTFM Mar 05 '19 at 01:06
@42- Awesome, didn't know of [`[.data.frame`](https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/Extract.data.frame) until now. – smci Mar 05 '19 at 02:12

score 4 · Accepted Answer · answered Mar 05 '19 at 01:02

4

There is a faster way to do this than coercing to either matrix or data.frame. Just use the [data.frame function.

`[.data.frame`( tab,  cbind(ri,ci) )
[1]    NA 42190    NA

This is the functional syntax for the [.data.frame function.

answered Mar 05 '19 at 01:02

IRTFM

258,963
21
364
487

Awesome. Can you explain in your answer why `[.data.frame` works natively on a data.table, and link to [its doc](https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/Extract.data.frame)? – smci Mar 05 '19 at 02:11
R functions are dispatched by the class of their first argument. Run `class(DT)` and you get `[1] "data.table" "data.frame"`. Since the "[" function is generic and has a dataframe method, i.e. the `{.data.frame` function. So the interpreter passes a data.table object to the code of that function rather than to `[.data.table`. Look at the output from `methods(`[`)` – IRTFM Mar 05 '19 at 06:04

Taking a data.table slice with a sequence of (row,col) indices

2 Answers2

Linked

Related