Algorithm for Distributed Standard Deviation

Question

I currently have an algorithm for finding the standard deviation on a cluster of machines where one node will request the whole data set from other nodes across a network and run the standard deviation calculation over the data once it is received.

What I would like is to process the data independently on each node, then send the result of that to the requesting node which will merge the results. This will reduce network traffic and calculate the results in parallel.

The question is if there is an algorithm that can do this, or if all standard deviation calculations rely on the whole result that has been processed so far.

score 1 · Answer 1 · answered Jun 01 '17 at 16:49

1

If s1 and s2 are standard deviation:

To merge s1 and s2 to get the combined standard deviation s, the formula is:

So you need to transmit the mean, standard deviation and number of samples over network from each machine. I couldn't write latex in stackoverflow, so posted the image instead. You can read more on the wikipedia page.

answered Jun 01 '17 at 16:49

Kaidul

15,409
15
81
150

When I worked this out by hand I got `n1(y1^2 - y^2) + n2(y2^2 - y^2)` instead of `n1(y1 - y)^2 + n2(y2 - y)^2`, is that correct? – SirGuy Jun 01 '17 at 18:20
Nevermind, I checked with a python script and they are both correct. – SirGuy Jun 01 '17 at 18:32

score 0 · Answer 2 · answered Jun 01 '17 at 16:46

You could have each node compute the sum_i, sum_squared_i and count_i of the data they have and then merge the results as:

totalSum = Sum(sum_i)
totalSumSquared = Sum(sum_squared_i)
totalCount = Sum(count_i)

mean = totalSum / totalCount
variance = (totalSumSquared - mean / totalCount) / (totalCount - 1)
sd = sqrt(variance)

Where the Sum(x_i) means the sum over all the nodes' computed x_i.
This algorithm can suffer from precision loss due to cancelling so you might prefer to adapt any of the other algorithms from here instead.

Algorithm for Distributed Standard Deviation

2 Answers2