Row-wise manipulation on large dataset

Question

I am looking for a faster way to achieve the operation below. The dataset contains > 1M rows but I have provided a simplified example to illustrate the task --

To create the data table --

dt <- data.table(name=c("john","jill"), a1=c(1,4), a2=c(2,5), a3=c(3,6), 
      b1=c(10,40), b2=c(20,50), b3=c(30,60))

colGroups <- c("a","b")   # Columns starting in "a", and in "b"

Original Dataset
-----------------------------------
name    a1   a2   a3   b1   b2   b3
john    1    2    3    10   20   30
jill    4    5    6    40   50   60

The above dataset is transformed such that 2 new rows are added for each unique name and in each row, the values are left shifted for each group of columns independently (in this example I have used a columns and b columns but there are many more)

Transformed Dataset
-----------------------------------
name    a1   a2   a3   b1   b2   b3
john    1    2    3    10   20   30  # First Row for John
john    2    3    0    20   30    0  # "a" values left shifted, "b" values left shifted
john    3    0    0    30   0     0  # Same as above, left-shifted again

jill    4    5    6    40   50   60  # Repeated for Jill
jill    5    6    0    50   60    0 
jill    6    0    0    60    0    0

And so on. My dataset is extremely large, which is why I am trying to see if there is an efficient way to implement this.

Thanks in advance.

Is there a fixed number of columns per `columnGroup`? Or does that value vary by column group? — Ricardo Saporta, Oct 03 '13 at 22:08
Are there only unique names/rows in your data? If not, how would you handle duplicate names.. that is, if your example data has "john" with all other colums having different values..? — Arun, Oct 03 '13 at 22:51
@Metrics - Have implemented this in KDB/Q, basically in-database using Q. For R - I tried `apply` taking each group at a time, so separate a groups, b groups and then cbind them. However, I had some issue in subsetting the elements using `apply(dt,1,FUN=function)` ... where function would generate 2 new rows with elements left shifted. — xbsd, Oct 03 '13 at 22:53
@Arun - sorry I should have clarified. All names are unique. All groups have fixed columns (a has 3, b 3, ...etc) — xbsd, Oct 03 '13 at 22:54
@RicardoSaporta, on vacation.. managed to sneak in for a while.. :) — Arun, Oct 03 '13 at 23:11

score 5 · Accepted Answer · edited Oct 25 '19 at 03:21

Update: A (much) faster solution would be to play with the indices as follows (takes about 4 seconds on 1e6*7):

ll <- vector("list", 3)
ll[[1]] <- copy(dt[, -1])
d_idx <- seq(2, ncol(dt), by=3)
for (j in 1:2) {
    tmp <- vector("list", 2)
    for (i in seq_along(colGroups)) {
        idx <- ((i-1)*3+2):((i*3)+1)
        cols <- setdiff(idx, d_idx[i]:(d_idx[i]+j-1))
        # ..cols means "look up one level"
        tmp[[i]] <- cbind(dt[, ..cols], data.table(matrix(0, ncol=j)))
    }
    ll[[j+1]] <- do.call(cbind, tmp)
}
ans <- cbind(data.table(name=dt$name), rbindlist(ll))
setkey(ans, name)

First attempt (old): Very interesting problem. I'd approach it using melt.data.table and dcast.data.table (from 1.8.11) as follows:

require(data.table)
require(reshape2)
# melt is S3 generic, calls melt.data.table, returns a data.table (very fast)
ans <- melt(dt, id=1, measure=2:7, variable.factor=FALSE)[, 
                    grp := rep(colGroups, each=nrow(dt)*3)]
setkey(ans, name, grp)
ans <- ans[, list(variable=c(variable, variable[1:(.N-1)], 
          variable[1:(.N-2)]), value=c(value, value[-1],
     value[-(1:2)]), id2=rep.int(1:3, 3:1)), list(name, grp)]
# dcast in reshape2 is not yet a S3 generic, have to call by full name
ans <- dcast.data.table(ans, name+id2~variable, fill=0L)[, id2 := NULL]

Benchmarking on 1e6 rows with same number of columns:

require(data.table)
require(reshape2)
set.seed(45)
N <- 1e6
dt <- cbind(data.table(name=paste("x", 1:N, sep="")), 
               matrix(sample(10, 6*N, TRUE), nrow=N))
setnames(dt, c("name", "a1", "a2", "a3", "b1", "b2", "b3"))
colGroups = c("a", "b")

system.time({
ans <- melt(dt, id=1, measure=2:7, variable.factor=FALSE)[, 
                    grp := rep(colGroups, each=nrow(dt)*3)]
setkey(ans, name, grp)
ans <- ans[, list(variable=c(variable, variable[1:(.N-1)], 
          variable[1:(.N-2)]), value=c(value, value[-1],
     value[-(1:2)]), id2=rep.int(1:3, 3:1)), list(name, grp)]
ans <- dcast.data.table(ans, name+id2~variable, fill=0L)[, id2 := NULL]

})

#   user  system elapsed 
# 45.627   2.197  52.051

Works fine here, just tried ... did you load dt and colGroups — xbsd, Oct 04 '13 at 00:52
@Arun - thanks, it is very fast indeed. My actual table has ~ 40 groups and 9 columns per group, will try it out on the same. — xbsd, Oct 04 '13 at 00:55
I was testing on 1.8.10, so I was getting `ans` as a d.f instead of d.t — Ricardo Saporta, Oct 04 '13 at 00:56
ah i c. Just to add some more context (lest this seems like an odd exercise), the operation above is required to prepare a time-series table with lagged predictors to measure seasonal variations using ML algorithms (the colGroups are time intervals). Thanks for the help. — xbsd, Oct 04 '13 at 01:04
@Arun - Yes, it's a lot faster. The performance on 1e6 is about 2 seconds. 1e6*7 is ~ 26secs. Nonetheless, about a 90%+ improvement over the code in the First Attempt. Very nice. — xbsd, Oct 04 '13 at 11:18
Prepared a Parameterized Version of this to handle my dataset, given below as "Answer" — xbsd, Oct 04 '13 at 14:36
A follow up on the above -- I ran it on my original dataset and it took several hours (overnight), which is understandable given the size of the dataset - the dataset was reduced to ~ 400K rows with approx 361 columns (1 + 40 groups X 9 columns per group). To reduce the time - I was thinking if the loops could be run using foreach %dopar% - that should save a considerable amount of time ... . It might not be linear with the number of cores, but any reduction over x hours would be useful. — xbsd, Oct 07 '13 at 03:54

Ricardo Saporta · Answer 2 · 2013-10-03T23:14:02.697

You can append the rows and then shift up the columns in groups. Since the total number of columns per group is fixed, you iterate over each group number.

## Add in the extra rows
dt <- dt[, rbindlist(rep(list(.SD), 3)), by=name]


### ASSUMING A FIXED NUMBER PER COLGROUP
N <- 3

colsShifting <- as.vector(sapply(colGroups, paste0, 2:N))

for (i in (2:N)-1 ) {
  current <- colsShifting[ (i) +  ( (N-1) * (seq_along(colGroups)-1) )]
  dt[, c(current) := {
              .NN <- .N; 
              .CROP <- .SD[1:(.NN-i)]  ## These lines are only for clean code. You can put it all into the `rbindlist` line
              rbindlist(list(.CROP, as.data.table(replicate(ncol(.SD), rep(0, i),simplify=FALSE ))))
            } 
      , .SDcols=current
      , by=name]
  }

which gives:

dt
#     name a1 a2 a3 b1 b2 b3
#  1: john  1  2  3 10 20 30
#  2: john  1  2  0 10 20  0
#  3: john  1  0  0 10  0  0
#  4: jill  4  5  6 40 50 60
#  5: jill  4  5  0 40 50  0
#  6: jill  4  0  0 40  0  0

xbsd · Answer 3 · 2013-10-04T14:47:42.060

Just an edit of @Arun (s) code for the chosen answer. Providing here as I cannot post in the comments section.

#Parameterized version of @Arun (author) code (in the selected answer)

#Shifting Columns in R
#--------------------------------------------
N = 5  # SET - Number of unique names
set.seed(5)
colGroups <- c("a","b") # ... (i) # SET colGroups
totalColsPerGroup <- 10 # SET Cols Per Group
numColsToLeftShift <- 8 # SET Cols to Shift

lenColGroups <- length(colGroups) # ... (ii)

# From (i) and (ii)
totalCols = lenColGroups * totalColsPerGroup


dt <- cbind(data.table(name=paste("x", 1:N, sep="")), 
            matrix(sample(5, totalCols*N, TRUE), nrow=N)) # Change 5 if needed

ll <- vector("list", numColsToLeftShift)
ll[[1]] <- copy(dt[, -1, with=FALSE])
d_idx <- seq(2, ncol(dt), by=totalColsPerGroup)
for (j in 1:(numColsToLeftShift)) {
  tmp <- vector("list", 2)
  for (i in seq_along(colGroups)) {
    idx <- ((i-1)*totalColsPerGroup+2):((i*totalColsPerGroup)+1) #OK
    tmp[[i]] <- cbind(dt[, setdiff(idx, d_idx[i]:(d_idx[i]+j-1)), 
                         with=FALSE], data.table(matrix(0, ncol=j)))

  }      
  ll[[j+1]] <- do.call(cbind, tmp)

}
ans <- cbind(data.table(name=dt$name), rbindlist(ll))
setkey(ans, name)

--

Row-wise manipulation on large dataset

3 Answers3

Linked