Computing Welford's variance for multiple data.table columns using RStorm

Question

Given the following data.table dt:

I want to calculate a running variance on columns a and b grouped by column i using Welford's Method and the RStorm package facilities. I followed along the example on page 4 of RStorm's vignette and read through an introductory paper on RStorm, but I'm unable to figure out how to make it work. Here's my code:

library(RStorm)
dt = data.table(i=c(1,2,2,3,3,3,4,4,4,4), a=c(1,1:2,1:3,1:4), b=c(NA,NA,2,1,2,NA,NA,2,3,NA)
in_cols = c('a','b')
out_cols <- paste0(in_cols, '.var.Welford')
## Calculaing variance using Welford's method
## See: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
## See: "RStorm: Developing and Testing Streaming Algorithms in R", R Journal Vol 6/1
var.Welford <- function(x, ...) {
    x <- as.numeric(x[1])
    params <- GetHash("params2")
    if (!is.data.frame(params)) {
        params <- list()
        params$M <- params$S <- params$n <- 0
    }
    x <- ifelse(is.na(x), params$M, x)
    n <- params$n + 1
    delta <- (x - params$M)
    M <- params$M + ( delta / (n + 1) )
    S <- params$S + delta*(x - M)
    SetHash("params2", data.frame(n=n,M=M,S=S))
    var <- ifelse(n > 1, S / (n-1), 0)
    TrackRow("var.Welford", data.frame(var = var))
}
computeVarWelford <- function(x) {
    topology <- Topology(as.data.frame(x=as.data.frame(x)))
    topology <- AddBolt(topology, Bolt(var.Welford, listen = 0))
    result <- RStorm(topology)
    # GetTrack('var.Welford', result)
    result$track$var.Welford
}

## Execute:
dt[, eval(out_cols) := lapply(.SD, function(x) {return(as.list(computeVarWelford(x))[1])})
, by=i, .SDcols = in_cols]

Executing the line above transforms dt into:

    i a  b                       a.var.Welford                       b.var.Welford
 1: 1 1 NA                                   0                                   0
 2: 2 1 NA                                 0,2                   0.000000,2.666667
 3: 2 2  2                                 0,2                   0.000000,2.666667
 4: 3 1  1                         0.0,2.0,2.5                               0,2,1
 5: 3 2  2                         0.0,2.0,2.5                               0,2,1
 6: 3 3 NA                         0.0,2.0,2.5                               0,2,1
 7: 4 1 NA 0.000000,2.000000,2.500000,3.333333 0.000000,2.666667,3.375000,2.250000
 8: 4 2  2 0.000000,2.000000,2.500000,3.333333 0.000000,2.666667,3.375000,2.250000
 9: 4 3  3 0.000000,2.000000,2.500000,3.333333 0.000000,2.666667,3.375000,2.250000
10: 4 4 NA 0.000000,2.000000,2.500000,3.333333 0.000000,2.666667,3.375000,2.250000

It's pretty clear from the results that the entire list of variances for each (column,group) pair is being copied into each element of that (column,group) pair, instead of being mapped to all elements of that (column,group) pair. This is what I really want:

    i a  b     a.var.Welford        b.var.Welford
 1: 1 1 NA     0                    0
 2: 2 1 NA     0                    0
 3: 2 2  2     2                    2.666667
 4: 3 1  1     0.0                  0
 5: 3 2  2     2.0                  2
 6: 3 3 NA     2.5                  1
 7: 4 1 NA     0.000000             0.000000
 8: 4 2  2     2.000000             2.666667
 9: 4 3  3     2.500000             3.375000
10: 4 4 NA     3.333333             2.250000

I'm really hoping there is a simple fix for this, but I haven't been able to figure it out for the life of me. Every time I try what I think should work, I end up getting an error from data.table saying

All items in j=list(...) should be atomic vectors or lists. If you are trying something like j=list(.SD,newcol=mean(colA)) then use := by group instead (much quicker), or cbind or merge afterwards.

which I understand to mean that the dimensions of the return value of whatever FUN I try in my lapply(.SD, FUN) code doesn't correspond to the dimensions the data.table expects for a column for that group.

Any and all help is much appreciated.

EDIT : Okay the solution was very simple. I feel stupid. But here's the answer for who may need it later

## Make sure to use [[]] at the end. My problem came entirely down to using [].
dt[, eval(out_cols) := lapply(.SD, function(x) {return(as.list(computeVarWelford(x))[[1]])})
   , by=i, .SDcols = in_cols]

This works like a charm. I got what I needed:

    i a  b a.var.Welford b.var.Welford
 1: 1 1 NA      0.000000      0.000000
 2: 2 1 NA      0.000000      0.000000
 3: 2 2  2      2.000000      2.666667
 4: 3 1  1      0.000000      0.000000
 5: 3 2  2      2.000000      2.000000
 6: 3 3 NA      2.500000      1.000000
 7: 4 1 NA      0.000000      0.000000
 8: 4 2  2      2.000000      2.666667
 9: 4 3  3      2.500000      3.375000
10: 4 4 NA      3.333333      2.250000

@VeerendraGadekar : That will give me k^2 rows for each group with k rows originally. `cSplit` will split each row. I need to split just the first row from each group, but then also place the split values into remaining rows of that group. — Synergist, Dec 31 '15 at 20:33
or you could use `rbindlist` like this `rbindlist(lapply(split(data, data$i), function(x){cSplit(x[1,], c('a.var.Welford', 'b.var.Welford'), ',', 'long')}))` — Veerendra Gadekar, Dec 31 '15 at 21:41
@VeerendraGadekar Yes, it's a `0`. I fixed the typo. Both your code statements above work....almost. I added the results as an edit in my question above. How can I get rid of the weird parsing/eval errors? — Synergist, Dec 31 '15 at 21:43
I do not get what you showed there but you can get rid of that by using `listCol_l(output, c('a.var.Welford', 'b.var.Welford'))`, where output would be the result of above command check [this](http://stackoverflow.com/questions/31551036/unlisting-columns-by-groups/31551332#31551332) for more — Veerendra Gadekar, Dec 31 '15 at 21:48
@VeerendraGadekar Thanks for your effort. I was able to fix my problem upstream, so I no longer need to parse those weird list results per row. — Synergist, Dec 31 '15 at 22:09

Synergist · Accepted Answer · 2015-12-31T22:08:18.750

EDIT: I no longer need the hack solution below. Here's the code to solve this problem (note the [[]] instead of [] fix):

dt[, eval(out_cols) := lapply(.SD, function(x) {return(as.list(computeVarWelford(x))[[1]])})
   , by=i, .SDcols = in_cols]

OLD: Okay, so I figured out a way to make it work finally. But I find the way pretty ugly. I'm going to accept this as my answer for now, but please if anybody has a better solution, I would love to hear it and will accept it as the answer to this question if it's better than mine.

Solution:

out_cols_fixed <- paste0(out_cols, '.fixed')
dt[,eval(out_cols_fixed) := lapply(.SD, function(x) { return(x[1][[1]]) }), by=i, .SDcols = out_cols]
dt[,eval(out_cols) := NULL]
setnames(dt, old = out_cols_fixed, new = out_cols)

Result for dt as desired:

    i a  b a.var.Welford b.var.Welford
 1: 1 1 NA      0.000000      0.000000
 2: 2 1 NA      0.000000      0.000000
 3: 2 2  2      2.000000      2.666667
 4: 3 1  1      0.000000      0.000000
 5: 3 2  2      2.000000      2.000000
 6: 3 3 NA      2.500000      1.000000
 7: 4 1 NA      0.000000      0.000000
 8: 4 2  2      2.000000      2.666667
 9: 4 3  3      2.500000      3.375000
10: 4 4 NA      3.333333      2.250000

I tried the following first, but it didn't work. Can anyone explain why?

dt[,eval(out_cols) := lapply(.SD, function(x) { return(x[1][[1]]) }), by=i, .SDcols = out_cols]

I get the following error from running the line above:

Error in [.data.table(dt, , :=(eval(out_cols), lapply(.SD, function(x) { : Type of RHS ('double') must match LHS ('list'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1)

Computing Welford's variance for multiple data.table columns using RStorm

1 Answers1