3

I currently have an algorithm for finding the standard deviation on a cluster of machines where one node will request the whole data set from other nodes across a network and run the standard deviation calculation over the data once it is received.

What I would like is to process the data independently on each node, then send the result of that to the requesting node which will merge the results. This will reduce network traffic and calculate the results in parallel.

The question is if there is an algorithm that can do this, or if all standard deviation calculations rely on the whole result that has been processed so far.

Andy Till
  • 3,371
  • 2
  • 18
  • 23

2 Answers2

1

If s1 and s2 are standard deviation: enter image description here

To merge s1 and s2 to get the combined standard deviation s, the formula is: enter image description here

So you need to transmit the mean, standard deviation and number of samples over network from each machine. I couldn't write latex in stackoverflow, so posted the image instead. You can read more on the wikipedia page.

Kaidul
  • 15,409
  • 15
  • 81
  • 150
  • When I worked this out by hand I got `n1(y1^2 - y^2) + n2(y2^2 - y^2)` instead of `n1(y1 - y)^2 + n2(y2 - y)^2`, is that correct? – SirGuy Jun 01 '17 at 18:20
  • Nevermind, I checked with a python script and they are both correct. – SirGuy Jun 01 '17 at 18:32
0

You could have each node compute the sum_i, sum_squared_i and count_i of the data they have and then merge the results as:

totalSum = Sum(sum_i)
totalSumSquared = Sum(sum_squared_i)
totalCount = Sum(count_i)

mean = totalSum / totalCount
variance = (totalSumSquared - mean / totalCount) / (totalCount - 1)
sd = sqrt(variance)

Where the Sum(x_i) means the sum over all the nodes' computed x_i.
This algorithm can suffer from precision loss due to cancelling so you might prefer to adapt any of the other algorithms from here instead.

SirGuy
  • 10,660
  • 2
  • 36
  • 66