Access data.table columns through vector indexes?

Question

i'm stucked with a problem but i can find no satisfying answers on the web. I would like to valorize a data.frame(also a data.table it's good for me) using start:end vectors. An example will clarify what i'm asking.

Suppose i have a data.framelike the following:

df <- data.frame(col_1 = rep(0, 3), col_2 = rep(0, 3), col_3 = rep(0, 3), col_4 = rep(0,3))
df
  col_1 col_2 col_3 col_4
1     0     0     0     0
2     0     0     0     0
3     0     0     0     0

And suppose i have two vectors:

indexesStart <- c(1, 2, 1)
indexesEnd   <- c(2, 4, 3)

I would like to valorize to 1 all values in the range indicated by the vectors by row. The output should be the following:

  col_1 col_2 col_3 col_4
1     1     1     0     0
2     0     1     1     1
3     1     1     1     0

I tried something like this:

df[ , indexesStart:indexesEnd] <- 1

But it doesn't work, it just takes indexesStart[1]:indexesEnd[1] and repeat it for all rows.

I must avoid loop cycles because my real data frame has millions rows and it is too slow. Any help is appreciated (a data.table solution would be even better)

Thank you

I think that can not be done without a loop (in one or the other form) because for every row you have another set of values to change. — jogo, Nov 21 '18 at 10:57

jogo · Accepted Answer · 2018-11-21T12:37:47.583

2

This will do it:

df <- data.frame(col_1=rep(0,3),col_2=rep(0,3),col_3=rep(0,3),col_4=rep(0,3))
indexesStart <- c(1, 2, 1)
indexesEnd   <- c(2, 4, 3)

for (i in 1:nrow(df)) df[i, indexesStart[i]:indexesEnd[i]] <- 1

df

Here is another technique using a twocolumn matrix as index:

I <- do.call(rbind, lapply(1:length(indexesStart), function(i) cbind(i, indexesStart[i]:indexesEnd[i])))
df[I] <- 1

In the second variant I hided the loop (and the hidden loop is in another place).

edited Nov 21 '18 at 12:37

answered Nov 21 '18 at 10:47

jogo

12,469
11
37
42

Thanks, but i need to avoid for loops – Andrea Rivolta Nov 21 '18 at 10:51
In my second variant I hided the loop (and the hidden loop is in another place). – jogo Nov 21 '18 at 11:51
1

The second variant works very fast (less than a second). I had to make some adjustment because (i did not specify this aspect in the post to have it easier) because for each row i have multiple range of columns (length(indexes) > dim(df)[1]), but i have been able to manage this through indexes mapping. Now, if i View (or print) df[I] everything is ok, but when i do the assignment it says: Error in [<-.data.frame(*tmp*`, I, value = 1) : 'value' is the wrong length – Andrea Rivolta Nov 21 '18 at 13:17
1

Ok, i solved it: the mapping introduced coordinates repetitions. Using unique(I) everything works fine. From 1600 to 2 seconds. Thank you very much – Andrea Rivolta Nov 21 '18 at 13:30

rookie · Answer 2 · 2018-11-21T12:58:10.447

0

Try this, it avoids any looping or lapply and is vectorized. This takes advantage of the fact that a data.frame is really a list.

impute <- function(lst, start, end){ lst[start:end] <- 1; lst }

fill <- function(df, start, end){
  cols <- names(df)
  lst <- as.list(as.data.frame(t(df)))
  res <- as.data.frame(t(Vectorize(impute)(lst, start, end)))
  names(res) <- names(df)
  row.names(res) <- row.names(df)
  res
}

res <- fill(df, indexesStart, indexesEnd)

Takes around 5 seconds to do 1 million rows on my MacBook Pro.

edited Nov 21 '18 at 12:58

answered Nov 21 '18 at 12:33

rookie

641
5
9

This solution works as well, but jogo's way is faster! Thank you anyway :) – Andrea Rivolta Nov 21 '18 at 13:32

Access data.table columns through vector indexes?

2 Answers2