Choose one cell per row in data frame

Question

I have a vector that tells me, for each row in a date frame, the column index for which the value in this row should be updated.

> set.seed(12008); n <- 10000; d <- data.frame(c1=1:n, c2=2*(1:n), c3=3*(1:n))
> i <- sample.int(3, n, replace=TRUE)
> head(d); head(i)
  c1 c2 c3
1  1  2  3
2  2  4  6
3  3  6  9
4  4  8 12
5  5 10 15
6  6 12 18
[1] 3 2 2 3 2 1

This means that for rows 1 and 4, c3 should be updated; for rows 2, 3 and 5, c2 should be updated (among others). What is the cleanest way to achieve this in R using vectorized operations, i.e, without apply and friends? EDIT: And, if at all possible, without R loops?

I have thought about transforming d into a matrix and then address the matrix elements using an one-dimensional vector. But then I haven't found a clean way to compute the one-dimensional address from the row and column indexes.

score 6 · Answer 1 · answered Jun 05 '12 at 10:47

6

With your example data, and using only the first few rows (D and I below) you can easily do what you want via a matrix as you surmise.

set.seed(12008)
n <- 10000
d <- data.frame(c1=1:n, c2=2*(1:n), c3=3*(1:n))
i <- sample.int(3, n, replace=TRUE)
## just work with small subset
D <- head(d)
I <- head(i)

First, convert D into a matrix:

dmat <- data.matrix(D)

Next compute the indices of the vector representation of the matrix corresponding to rows and columns indicated by I. For this, it is easy to generate the row indices as well as the column index (given by I) using seq_along(I) which in this simple example is the vector 1:6. To compute the vector indices we can use:

(I - 1) * nrow(D) + seq_along(I)

where the first part ( (I - 1) * nrow(D) ) gives us the correct multiple of the number of rows (6 here) to index the start of the Ith column. We then add on the row index to get the index for the n-th element in the Ith column.

Using this we just index into dmat using "[", treating it like a vector. The replacement version of "[" ("[<-") allows us to do the replacement in a single line. Here I replace the indicated elements with NA to make it easier to see that the correct elements were identified:

> dmat
  c1 c2 c3
1  1  2  3
2  2  4  6
3  3  6  9
4  4  8 12
5  5 10 15
6  6 12 18
> dmat[(I - 1) * nrow(D) + seq_along(I)] <- NA
> dmat
  c1 c2 c3
1  1  2 NA
2  2 NA  6
3  3 NA  9
4  4  8 NA
5  5 NA 15
6 NA 12 18

answered Jun 05 '12 at 10:47

Gavin Simpson

170,508
25
396
453

Thank you. But is this construct `(I - 1) * nrow(D) + seq_along(I)` encapsulated in some function that is publicly accessible? (More general, I'm looking for something like `matrix.index(m, r, c)` where `r` is the row vector and `c` is the column vector. I know how to build it, but this must be in R core somewhere, no?) How does matrix addressing work internally? – krlmlr Jun 05 '12 at 11:02
No, it is not. `I` is the column (`c` in your notation), `seq_along(I)` is the row (or `r`). I used the things I did because of your example, though `i` is a vector as long as the number of rows according to your example so my code still works even for big `i`. For the last bit, study the C code or the R Internals documentations; it is all done in C, but note that as far as R is concerned, a matrix is just a vector with elements stacked columnwise, i.e. columns are filled first so when treating a matrix as a vector, all the rows of col 1 come first, then the rows of column 2 etc. – Gavin Simpson Jun 05 '12 at 11:24
@user946850 That said, there is nothing stopping you writing a `matrixIndex()` using the example shown above. You can put that in your own private package and load it (or arrange for it to be loaded automagically) at the start of each R session. – Gavin Simpson Jun 05 '12 at 11:25
@user946850 Why? This is a simple operation so easy to cook up. The R Core Development team have taken the stance that they want the core to be reasonably lean, so as to minimise the maintenance burden, or to include things that they want/need for their research etc. Code/functions in packages are first class R citizens, so there is no need to have everything in base R. Note that more complex indexing jobs (select the upper or lower triangles of a matrix) *are* included in base R, but not, as far as I know, the simple case you mention. It may well be in another package on CRAN? – Gavin Simpson Jun 05 '12 at 13:43
Consistency. The way multi-dimensional indexes map to the storage is a design parameter, tightly coupled with the R core. What if this is ever changed? -- Other than that, indeed I agree that this can be in a separate package as well. – krlmlr Jun 05 '12 at 14:00
@user946850 if you want consistency, look elsewhere than R ;-) R is stable, so don't expect the external way matrices are indexed to change. Internally the code to do it will change but the code in my example will always work. – Gavin Simpson Jun 05 '12 at 14:10
2

@user946850 & Gavin (+1) -- Looks like the R-core folks do see some utility in this, and have added it (well, its equivalent functionality) to the current R-devel. (More details in an answer I just added below). – Josh O'Brien Jun 07 '12 at 16:42

Josh O'Brien · Accepted Answer · 2012-06-10T17:46:57.320

4

If you are willing to first convert your data.frame to a matrix, you can index elements-to-be-replaced using a two-column matrix. (Beginning with R-2.16.0, this will be possible with data.frames directly.) The indexing matrix should have row indices in its first column and column indices in its second column.

Here's an example:

## Create a subset of the your data
set.seed(12008); n  <- 6 
D  <- data.frame(c1=1:n, c2=2*(1:n), c3=3*(1:n))
i <- seq_len(nrow(D))            # vector of row indices
j <- sample(3, n, replace=TRUE)  # vector of column indices 
ij <- cbind(i, j)                # a 2-column matrix to index a 2-D array 
                                 # (This extends smoothly to higher-D arrays.)  

## Convert it to a matrix    
Dmat <- as.matrix(D)

## Replace the elements indexed by 'ij'
Dmat[ij] <- NA
Dmat
#      c1 c2 c3
# [1,]  1  2 NA
# [2,]  2 NA  6
# [3,]  3 NA  9
# [4,]  4  8 NA
# [5,]  5 NA 15
# [6,] NA 12 18

Beginning with R-2.16.0, you will be able to use the same syntax for dataframes (i.e. without having to first convert dataframes to matrices).

From the R-devel NEWS file:

Matrix indexing of dataframes by two column numeric indices is now supported for replacement as well as extraction.

Using the current R-devel snapshot, here's what that looks like:

D[ij] <- NA
D
#   c1 c2 c3
# 1  1  2 NA
# 2  2 NA  6
# 3  3 NA  9
# 4  4  8 NA
# 5  5 NA 15
# 6 NA 12 18

edited Jun 10 '12 at 17:46

answered Jun 07 '12 at 16:36

Josh O'Brien

159,210
26
366
455

Has that been ported to the 2.15.1 branch? R-devel would usually mean the next minor version, i.e. 2.16.x. – Gavin Simpson Jun 07 '12 at 16:54
@GavinSimpson -- Nice catch. Thanks. Looking again, I now see the prominent note that the "r59537 development snapshot of R [...] will eventually become R-2.16.0". Will edit my post accordingly. – Josh O'Brien Jun 07 '12 at 17:03
Will there be analogous support for matrices/arrays, too? – krlmlr Jun 07 '12 at 17:10
@user946850 -- There already is! Try this, to see how it works with arrays: `a <- array(1:64, dim=c(4,4,4)); a[cbind(1:4, c(4,3,1,2), 1:4)] <- NA`. – Josh O'Brien Jun 07 '12 at 17:15
@JoshO'Brien: Would you mind adding this as an answer in its own right, or perhaps edit this answer? This is much cleaner than what Gavin proposed, indeed. (Is there a reference to this kind of addressing in the docs?) – krlmlr Jun 07 '12 at 17:32
@user946850 -- Good idea. I've edited this post to demo both the currently available and the soon-to-be available options. – Josh O'Brien Jun 07 '12 at 19:09

score 3 · Answer 3 · answered Jun 05 '12 at 09:34

3

Here's one way:

d[which(i == 1), "c1"] <- "one"
d[which(i == 2), "c2"] <- "two"
d[which(i == 3), "c3"] <- "three"

   c1  c2    c3
1   1   2 three
2   2 two     6
3   3 two     9
4   4   8 three
5   5 two    15
6 one  12    18

answered Jun 05 '12 at 09:34

Roman Luštrik

69,533
24
154
197

Thank you. This requires a loop over the columns, which isn't too bad. Still, is there a fully vectorized solution? – krlmlr Jun 05 '12 at 09:46

Choose one cell per row in data frame

3 Answers3

Linked