2

i'm stucked with a problem but i can find no satisfying answers on the web. I would like to valorize a data.frame(also a data.table it's good for me) using start:end vectors. An example will clarify what i'm asking.

Suppose i have a data.framelike the following:

df <- data.frame(col_1 = rep(0, 3), col_2 = rep(0, 3), col_3 = rep(0, 3), col_4 = rep(0,3))
df
  col_1 col_2 col_3 col_4
1     0     0     0     0
2     0     0     0     0
3     0     0     0     0

And suppose i have two vectors:

indexesStart <- c(1, 2, 1)
indexesEnd   <- c(2, 4, 3)

I would like to valorize to 1 all values in the range indicated by the vectors by row. The output should be the following:

  col_1 col_2 col_3 col_4
1     1     1     0     0
2     0     1     1     1
3     1     1     1     0

I tried something like this:

df[ , indexesStart:indexesEnd] <- 1

But it doesn't work, it just takes indexesStart[1]:indexesEnd[1] and repeat it for all rows.

I must avoid loop cycles because my real data frame has millions rows and it is too slow. Any help is appreciated (a data.table solution would be even better)

Thank you

Henrik
  • 65,555
  • 14
  • 143
  • 159
  • I think that can not be done without a loop (in one or the other form) because for every row you have another set of values to change. – jogo Nov 21 '18 at 10:57

2 Answers2

2

This will do it:

df <- data.frame(col_1=rep(0,3),col_2=rep(0,3),col_3=rep(0,3),col_4=rep(0,3))
indexesStart <- c(1, 2, 1)
indexesEnd   <- c(2, 4, 3)

for (i in 1:nrow(df)) df[i, indexesStart[i]:indexesEnd[i]] <- 1

df

Here is another technique using a twocolumn matrix as index:

I <- do.call(rbind, lapply(1:length(indexesStart), function(i) cbind(i, indexesStart[i]:indexesEnd[i])))
df[I] <- 1

In the second variant I hided the loop (and the hidden loop is in another place).

jogo
  • 12,469
  • 11
  • 37
  • 42
  • Thanks, but i need to avoid for loops – Andrea Rivolta Nov 21 '18 at 10:51
  • In my second variant I hided the loop (and the hidden loop is in another place). – jogo Nov 21 '18 at 11:51
  • 1
    The second variant works very fast (less than a second). I had to make some adjustment because (i did not specify this aspect in the post to have it easier) because for each row i have multiple range of columns (length(indexes) > dim(df)[1]), but i have been able to manage this through indexes mapping. Now, if i View (or print) df[I] everything is ok, but when i do the assignment it says: Error in [<-.data.frame(*tmp*`, I, value = 1) : 'value' is the wrong length – Andrea Rivolta Nov 21 '18 at 13:17
  • 1
    Ok, i solved it: the mapping introduced coordinates repetitions. Using unique(I) everything works fine. From 1600 to 2 seconds. Thank you very much – Andrea Rivolta Nov 21 '18 at 13:30
0

Try this, it avoids any looping or lapply and is vectorized. This takes advantage of the fact that a data.frame is really a list.

impute <- function(lst, start, end){ lst[start:end] <- 1; lst }

fill <- function(df, start, end){
  cols <- names(df)
  lst <- as.list(as.data.frame(t(df)))
  res <- as.data.frame(t(Vectorize(impute)(lst, start, end)))
  names(res) <- names(df)
  row.names(res) <- row.names(df)
  res
}

res <- fill(df, indexesStart, indexesEnd)

Takes around 5 seconds to do 1 million rows on my MacBook Pro.

rookie
  • 641
  • 5
  • 9