3

I took this as a reference for online computing the variance and mean from a variable-length array of data: http://www.johndcook.com/standard_deviation.html.

The data is a set from 16-bit unsigned values, which may have any number of samples (actually, the minimum would be about 20 samples, and the maximum about 2e32 samples.

As the dataset may be too big to store, I already implemented this using the above-mentioned online algorithm in C and verified it's computing correctly.

The trouble begins with the following requirement for the application: besides computing the variance and mean for the whole set, I also need to compute a separated result (both mean and variance) for a population comprised of the middle 50% of the values, i.e. disregarding the first 25% and the latter 25% of the samples. The number of samples is not known beforehand, so I must compute the additional set online.

I understand that I can both add and subtract a subset by computing it separately and them using something like the operator+ implementation described here: http://www.johndcook.com/skewness_kurtosis.html (minus the skewness & kurtosis specifics, for which I have no use). The subtraction could be derived from this.

The problem is: how do I maintain these subsets? Or should I try another technique?

  • One problem is that even knowing what the 25% and 75%s are is pretty hard/complicated. AFAIK, there aren't simple online algorithms for quantile estimation that have nice guarantees. – Rob Neuhaus Jul 28 '14 at 20:09

1 Answers1

3

If space is an issue, and you'd be happy to accept an approximation, I'd start with the algorithm from the following paper:

M Greenwald, S Khanna, Space-Efficient Online Computation of Quantile Summaries

You can use the algorithm to compute the running estimates of the 25th and 75th percentiles of the observations seen to far. You can then feed those observations that fall between the two percentiles into the Welford algorithm covered in John D Cook's article to compute the running mean and variance.

NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • An approximation would suffice, but I didn't get how to do the feeding (the final part of your explanation). – Alexandre Pereira Nunes Jul 28 '14 at 23:39
  • @AlexandrePereiraNunes: You drop everything less than the 25th percentile and greater than the 75th percentile, and compute the running mean and standard deviation of everything else. – NPE Jul 29 '14 at 05:42
  • what I need is to compute the second and third quarters of all the acquired samples, i.e. if I captured 1000 samples, I need to compute the mean and stddev of all samples between 250 and 750. I can't store enough samples (assume 1000 could end up being 1e32) for compute this in a second pass. What I understand of what you're suggesting is that I would end up computing the middle 50% of most samples, even those at beginning and end of the whole set, and these are the ones I would want to disregard. Not sure if I made myself clear nor if I understood what you said. – Alexandre Pereira Nunes Jul 29 '14 at 17:31
  • @AlexandrePereiraNunes: As I said, this is an approximation. If you need the exact answer, you'll need to store the entire set. For this, you can make use of the fact that your data can only contain 65,536 distinct values, and just count how many of each you've seen, a la counting sort (http://en.wikipedia.org/wiki/Counting_sort). – NPE Jul 29 '14 at 20:08
  • I think an approximation would suffice, I just didn't get your though: were you suggesting that I took an equivalent number of percentiles in my interest area and computed the (estimated) average and std dev on them? If that was it, them thanks, because I suppose this would be sufficient, at least as long as I can control the error, as the paper you pointed out suggest. – Alexandre Pereira Nunes Jul 30 '14 at 13:07
  • Also, thanks for mentioning the counting sort. The glimpse of this idea crossed my mind, but I didn't think it was worth considering it. The interesting fact is that on a future application on the same project, I was going to need the full histogram too, and this solution resolves all problems at once. – Alexandre Pereira Nunes Jul 30 '14 at 13:19