6

How can I use data.table syntax to produce a data.table where each column contains the differences between the column of the original data.table and the next column?

Example: I have a data.table where each row is a group, and each column is surviving population after year 0, after year 1, 2, etc. Such as:

pop <- data.table(group_id = c(1, 2, 3), 
                   N = c(4588L, 4589L, 4589L), 
                   N_surv_1 = c(4213, 4243, 4264), 
                   N_surv_2 = c(3703, 3766, 3820), 
                   N_surv_3 = c(2953, 3054, 3159) )
# group_id    N N_surv_1 N_surv_2 N_surv_3
#        1 4588     4213     3703     2953
#        2 4589     4243     3766     3054
#        3 4589     4264     3820     3159

(Data types differ because N is a true integer count and N_surv_1, etc. are projections that could be fractional.)

What I have done: using the base diff and matrix transposition, we can:

diff <- data.table(t(diff(t(as.matrix(pop[,-1,with=FALSE])))))
setnames(diff, paste0("deaths_",1:ncol(diff)))
cbind(group_id = pop[,group_id],diff) 
# produces desired output:
#    group_id deaths_1 deaths_2 deaths_3
#           1     -375     -510     -750
#           2     -346     -477     -712
#           3     -325     -444     -661

I know that I can use base diff by group on a single column produced by melt.data.table, so this works but ain't pretty:

melt(pop, 
     id.vars = "group_id"
     )[order(group_id)][, setNames(as.list(diff(value)),
                                   paste0("deaths_",1:(ncol(pop)-2)) ),
                          keyby = group_id]

Is that the most data.table-riffic way to do this, or is there a way to do it as a multi-column operation in data.table?

C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134

3 Answers3

7

Well, you could subtract the subsets:

ncols = grep("^N(_surv_[0-9]+)?", names(pop), value=TRUE)
pop[, Map(
  `-`, 
   utils:::tail.default(.SD, -1), 
   utils:::head.default(.SD, -1)
), .SDcols=ncols]

#    N_surv_1 N_surv_2 N_surv_3
# 1:     -375     -510     -750
# 2:     -346     -477     -712
# 3:     -325     -444     -661

You could assign these values to new columns with :=. I have no idea why tail and head are not made more easily available... As pointed out by @akrun, you could use with=FALSE instead, like pop[, .SD[, -1, with=FALSE] - .SD[, -ncol(.SD), with=FALSE], .SDcols=ncols].

Anyway, this is pretty convoluted compared to simply reshaping:

melt(pop, id="group_id")[, tail(value, -1) - head(value, -1), by=group_id]
#    group_id   V1
# 1:        1 -375
# 2:        1 -510
# 3:        1 -750
# 4:        2 -346
# 5:        2 -477
# 6:        2 -712
# 7:        3 -325
# 8:        3 -444
# 9:        3 -661
Frank
  • 66,179
  • 8
  • 96
  • 180
2

Without reshaping data and each row with a unique id, you can group by the id column and then calculate the difference with diff on each row, i.e. unlist(.SD):

pop[, setNames(as.list(diff(unlist(.SD))), paste0("deaths_", 1:(ncol(pop)-2))), group_id]

#    group_id deaths_1 deaths_2 deaths_3
# 1:        1     -375     -510     -750
# 2:        2     -346     -477     -712
# 3:        3     -325     -444     -661

Essentially, something like this if you ignore setting up the column names:

pop[, as.list(diff(unlist(.SD))), group_id]
Psidom
  • 209,562
  • 33
  • 339
  • 356
2

Here's another way to do it without reshaping or grouping which might make it faster. If it's small number of rows then it probably won't be a noticeable difference.

cols<-names(pop)[-1]
combs<-list()
for(i in 2:length(cols)) {
  combs[[length(combs)+1]]<-c(cols[i-1], cols[i])
}
newnames<-sapply(combs,function(x) gsub('N_surv','death',x[2]))
deathpop<-copy(pop)
deathpop[,(newnames):=lapply(combs,function(x) get(x[2])-get(x[1]))]
deathpop[,(cols):=NULL]

I did some benchmarking

rows<-10000000
pop <- data.table(group_id = 1:rows, 
                  N = runif(rows,3000,4000), 
                  N_surv_1 = runif(rows,3000,4000), 
                  N_surv_2 = runif(rows,3000,4000), 
                  N_surv_3 = runif(rows,3000,4000))
system.time({
    cols<-names(pop)[-1]
    combs<-list()
    for(i in 2:length(cols)) {
      combs[[length(combs)+1]]<-c(cols[i-1], cols[i])
    }
    newnames<-sapply(combs,function(x) gsub('N_surv','death',x[2]))
    deathpop<-copy(pop)
    deathpop[,(newnames):=lapply(combs,function(x) get(x[2])-get(x[1]))]
    deathpop[,(cols):=NULL]})

and it returned

user  system elapsed 
0.192   0.808   1.003 

In contrast I did

system.time(pop[, as.list(diff(unlist(.SD))), group_id])

and it returned

   user  system elapsed 
169.836   0.428 170.469 

I also did

system.time({
  ncols = grep("^N(_surv_[0-9]+)?", names(pop), value=TRUE)
  pop[, Map(
    `-`, 
    utils:::tail.default(.SD, -1), 
    utils:::head.default(.SD, -1)
  ), .SDcols=ncols]
})

which returned

 user  system elapsed 
0.044   0.044   0.089 

Finally, doing

system.time(melt(pop, id="group_id")[, tail(value, -1) - head(value, -1), by=group_id])

returns

   user  system elapsed 
223.360   1.736 225.315 

Frank's Map solution is fastest. If you take the copying out of mine then it gets a lot closer to Frank's time but his still wins for this test case.

Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
  • Interesting that melt is slow, but I guess not too surprising. I'd still advocate it as the "right" way to store the data. That is, it should be long form with some "time" column capturing the sequence of population measurements for each group. I'm guessing akrun's solution (found in my answer: `pop[, .SD[, -1, with=FALSE] - .SD[, -ncol(.SD), with=FALSE], .SDcols=ncols]`) is also fairly fast. – Frank Aug 17 '16 at 18:48