Given the following data.table
dt
:
i a b
1: 1 1 NA
2: 2 1 NA
3: 2 2 2
4: 3 1 1
5: 3 2 2
6: 3 3 NA
7: 4 1 NA
8: 4 2 2
9: 4 3 3
10: 4 4 NA
I want to calculate a running variance on columns a
and b
grouped by column i
using Welford's Method and the RStorm
package facilities. I followed along the example on page 4 of RStorm's vignette and read through an introductory paper on RStorm, but I'm unable to figure out how to make it work. Here's my code:
library(RStorm)
dt = data.table(i=c(1,2,2,3,3,3,4,4,4,4), a=c(1,1:2,1:3,1:4), b=c(NA,NA,2,1,2,NA,NA,2,3,NA)
in_cols = c('a','b')
out_cols <- paste0(in_cols, '.var.Welford')
## Calculaing variance using Welford's method
## See: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
## See: "RStorm: Developing and Testing Streaming Algorithms in R", R Journal Vol 6/1
var.Welford <- function(x, ...) {
x <- as.numeric(x[1])
params <- GetHash("params2")
if (!is.data.frame(params)) {
params <- list()
params$M <- params$S <- params$n <- 0
}
x <- ifelse(is.na(x), params$M, x)
n <- params$n + 1
delta <- (x - params$M)
M <- params$M + ( delta / (n + 1) )
S <- params$S + delta*(x - M)
SetHash("params2", data.frame(n=n,M=M,S=S))
var <- ifelse(n > 1, S / (n-1), 0)
TrackRow("var.Welford", data.frame(var = var))
}
computeVarWelford <- function(x) {
topology <- Topology(as.data.frame(x=as.data.frame(x)))
topology <- AddBolt(topology, Bolt(var.Welford, listen = 0))
result <- RStorm(topology)
# GetTrack('var.Welford', result)
result$track$var.Welford
}
## Execute:
dt[, eval(out_cols) := lapply(.SD, function(x) {return(as.list(computeVarWelford(x))[1])})
, by=i, .SDcols = in_cols]
Executing the line above transforms dt
into:
i a b a.var.Welford b.var.Welford
1: 1 1 NA 0 0
2: 2 1 NA 0,2 0.000000,2.666667
3: 2 2 2 0,2 0.000000,2.666667
4: 3 1 1 0.0,2.0,2.5 0,2,1
5: 3 2 2 0.0,2.0,2.5 0,2,1
6: 3 3 NA 0.0,2.0,2.5 0,2,1
7: 4 1 NA 0.000000,2.000000,2.500000,3.333333 0.000000,2.666667,3.375000,2.250000
8: 4 2 2 0.000000,2.000000,2.500000,3.333333 0.000000,2.666667,3.375000,2.250000
9: 4 3 3 0.000000,2.000000,2.500000,3.333333 0.000000,2.666667,3.375000,2.250000
10: 4 4 NA 0.000000,2.000000,2.500000,3.333333 0.000000,2.666667,3.375000,2.250000
It's pretty clear from the results that the entire list of variances for each (column,group) pair is being copied into each element of that (column,group) pair, instead of being mapped to all elements of that (column,group) pair. This is what I really want:
i a b a.var.Welford b.var.Welford
1: 1 1 NA 0 0
2: 2 1 NA 0 0
3: 2 2 2 2 2.666667
4: 3 1 1 0.0 0
5: 3 2 2 2.0 2
6: 3 3 NA 2.5 1
7: 4 1 NA 0.000000 0.000000
8: 4 2 2 2.000000 2.666667
9: 4 3 3 2.500000 3.375000
10: 4 4 NA 3.333333 2.250000
I'm really hoping there is a simple fix for this, but I haven't been able to figure it out for the life of me. Every time I try what I think should work, I end up getting an error from data.table
saying
All items in j=list(...) should be atomic vectors or lists. If you are trying something like j=list(.SD,newcol=mean(colA)) then use := by group instead (much quicker), or cbind or merge afterwards.
which I understand to mean that the dimensions of the return value of whatever FUN
I try in my lapply(.SD, FUN)
code doesn't correspond to the dimensions the data.table
expects for a column for that group.
Any and all help is much appreciated.
EDIT : Okay the solution was very simple. I feel stupid. But here's the answer for who may need it later
## Make sure to use [[]] at the end. My problem came entirely down to using [].
dt[, eval(out_cols) := lapply(.SD, function(x) {return(as.list(computeVarWelford(x))[[1]])})
, by=i, .SDcols = in_cols]
This works like a charm. I got what I needed:
i a b a.var.Welford b.var.Welford
1: 1 1 NA 0.000000 0.000000
2: 2 1 NA 0.000000 0.000000
3: 2 2 2 2.000000 2.666667
4: 3 1 1 0.000000 0.000000
5: 3 2 2 2.000000 2.000000
6: 3 3 NA 2.500000 1.000000
7: 4 1 NA 0.000000 0.000000
8: 4 2 2 2.000000 2.666667
9: 4 3 3 2.500000 3.375000
10: 4 4 NA 3.333333 2.250000