-6

Is there a formula to calculate the Standard deviation biased on the sum or subtraction of others data sets ?

Example:

Dataset1 (5 elements to count):
values: 5,10,15,20,25
mean: 15
Sum of Squared mean: 275 (5^2+10^2+...)/5
Population variance: 50
Population Standard deviation: 7,071067812
Population Max STD  22,07106781
Population Min STD  7,928932188
Dataset2 (5 elements to count):
values: 2,4,11,7,16
mean: 8
Sum of Squared mean: 89,2 (2^2+4^2+...)/5
Population variance: 25,2
Population Standard deviation: 5,019960159
Population Max STD  13,01996016
Population Min STD  2,980039841
Dataset3 (5 elements to count):
The elements are a sum of the previous dataset
values: 7,14,26,27,41
mean: 23 (<-- Ok, sum of the previous means)
Sum of Squared mean: 666,2
Population variance: 137,2
Population Standard deviation: 11,71324037
Population Max STD  34,71324037
Population Min STD  11,28675963

The mean of Data3 is easily computed as Mean of Data1 + Mean of data2

But,...how to I calculate the their values?

For example, knowing that the Squared Sum can be used to calculate the variance. Is there a way to calculate directly the Squared Sum of Data3 using a formula biased on the Data1 and Data2?

If not, is there a way to calculate variance of Data3, without using covariance ? (This is because, covariance will assume I´ll have to perform another calculation of sums). I was thinking ins a formula more directly, instead of calculating each element all over again.

too honest for this site
  • 12,050
  • 4
  • 30
  • 52
guga
  • 79
  • 10
  • 1
    Not clear what you actually want. And only tag the language you actually use. C is not C++ is not C. – too honest for this site Dec 17 '15 at 15:44
  • 1
    this is a maths question and not a programming question. the answer is no, but you can keep intermediate results that you can use to calculate the variances of the sums – 463035818_is_not_an_ai Dec 17 '15 at 15:44
  • @tobi303 With Standard Deviation, good code often keeps the intermediate results in higher precision and range in order to accommodate wide ranging values. The issues concerning this are very programming related and thus applicable to SO. So I assert this is a relevant programming question (as well as a math one). – chux - Reinstate Monica Dec 17 '15 at 15:55
  • In fact, i´m using asembly, but, i want to understand how to compute the Standar deviation results by summing the ones in 2 different data sets. – guga Dec 17 '15 at 16:07
  • Hi Chex. Thanks for the comments. Yes, this is a math question so i an use in a app i´m building to calculate the correct timmings of a given instruction. To achieve a better accuracy im using different algorithms to get the correct timmings 9clock cycles), ex: cpuid+rdtsc, lfence+rdtsc, rdtscp, QueryPerformanceCounter (API) etc. In total, there are 8 algorithms used to compute the "better" timming. So far, i acheved a high rate of accuracy, and fund timmings of around 0,621 nanosecs for "xor eax, eax" for example wich is close to the Intel measures. The problem is, with the math described – guga Dec 17 '15 at 16:12
  • I´m having troubles to find a correct formula to compute the Addition or Subraction of 2 different STandar deviations found on Different data sets. So, instead i have to compute all over again, i´m trying to find the correct maths to do this. Otehrwise the code will be a bit slow :( – guga Dec 17 '15 at 16:14
  • "Otehrwise the code will be a bit slow" Is that actually more than a theoretical issue? Quantify "slow". – too honest for this site Dec 17 '15 at 16:42
  • Does your machine perform FP quickly? Else this task should be done all in integer arithmetic. Posting the range of possible values and max count of elements is useful, especially if either is large. – chux - Reinstate Monica Dec 17 '15 at 16:50
  • In what _order_ does data arrive? First the `5,10,15,20,25` and then `7,14,26,27,41` or are they interleaved `5,7,10,14,15,26,20,27,25,41` or both available simultaneous or what? – chux - Reinstate Monica Dec 17 '15 at 17:08
  • Yep, my Machine process FP. The problem is that i wanted to sum those 2 datasets after they are ccreated. I mean, i calculate the STD of each one of them and only after it, i use a formula to sum both results. Of course, i could simply add/subtract to the 2nd data set the results from the 1st one while it is being calculated (running), but it will turn the app too slow. This is why a formula (If existent) will do the work better. Even thinking that on a I7 the code works fast, it will take extra seconds to finish, unnecessarily. – guga Dec 17 '15 at 17:12
  • For instance, on the actual state. to achieve a high level of accuracy, a range from 300 to 3000 loops (iterations) are enough. But, this is only on external loop, because internally the algo loops about 3000 times untill it find a "good" Timming. It analyse the "good" one checking for it´s stability whcih is calculated analysing tjeh results of the Standard Deviation (STD). A stable timming is when the STD is anywhere in between 0 and 1, meaning that there are almost no variances from the mean. – guga Dec 17 '15 at 17:17
  • Once it found, it collects more 3000 data on this same results. After collecting all it do another STD of those results and. The final STD is the correct timings that are being calculated. The problem is with a fine tune. In order to avoid overheads, stalings etc. I need to calibrate the algo, perfoming a previous Computaton of everything before the main routines are used. This will result on another STD table from where i can subtract the final result from it. That´s why a math formula to calculate the adition or subracton of the both STD data sets are usefull here. Avoid slowdowns.. – guga Dec 17 '15 at 17:22
  • Hmm... The order is purelly random. I mean, i have 2 tables each one of them contaning X values. I need to take element1 from table1 and sum with element1 from table 2. etc. This will result on a 3rd table with a new STD. – guga Dec 17 '15 at 17:24
  • Like this: DataSet1 values: 5,10,15,20,25 , dataSet2 2,4,11,7,16. Both have their own STD and Mean as described. When i add both datas i need to find the STD of the new Values. Ex: NewData Set = 5+2, 10+4, 15+11, 20+7, +25+16.....I want to find the new STD, because the only thing i know for sure is that the Mean of DataSet1+Mean of dataSet2 = mean of dataSet3. So, Newmean = Mean1+Mean2.... But, i want to find the other values as well, on the same way using a formula. (STD, variance etc) – guga Dec 17 '15 at 17:25
  • Consider set A{-2,-1,0,-1,-2}, set B{2,1,0,-1,-2}. Both have exactly the same average, STD, min, max and count. Using only those 5 facts of set information we cannot distinguish set A and B. Consider adding set A to A versus adding set B to A. The STD(A+A) is twice STD(A) and STD(A+B) is 0.0. – chux - Reinstate Monica Dec 17 '15 at 19:42
  • Hi Chux...No they don´t have the same mean. Mean of Set A = -1.2. Mean of Set B = 0 . Perhaps, you mistyped B = 2, 1, 0, 1, 2 (All positives) where as Set A are all negatives (-2, -1, 0, -1, -2). But, even then, the means are different. SetA = -1.2, Set B = 1.2 . The rule in that case is the one i posted.I.e. The multiplicand factor rule. On this case, each element is multiplied by -1 . So, if Set A = -1.2, we don´t need to compute again the element of SetB, all we need to do is multiply the mean from Set A by -1, which will result in a mean for SetB = 1.2 – guga Dec 19 '15 at 02:11
  • Completing the previous comment. Or, instead of multiply by a factor of -1. You can also add by a factor of -2*X. Where X = elements. So, -2+(-2 * -2), -1+(-2 * -1). = -2+4, -1+2.... = 2, 1..Therefore, you are adding them by -2 * X (now, X = Oldmean). So, the new mean is the same rule for adition. I.e: Factor+Oldmean. = -2*OldMean + Oldmean = -2* (-1.2) + (-1.2) = . 2.4 - 1.2 = 1.2 . So, Mean of SetB = 1.2 from whoch i retrieved without needing to sum all elements of SetB all over again. All i needed was to know what was the mean of SetA and the factor used to choose the proper rule. – guga Dec 19 '15 at 02:25
  • @guga My type mistake A{-2,-1,0,1,2} (Same elements of Set B, but the opposite order.) Same mean, STD, max, min. Now consider the Mean,STD, min,max of A+A and A+B, A+A mean:0: STD:~1.2*2 Min -4, Max, 4 and A+B, mean:0, STD:0, min:0. max:0. Knowing the Mean, STD,Count of A and of B cannot deduce the STD of A+A nor A+B. – chux - Reinstate Monica Dec 19 '15 at 03:30
  • @guga I recommend you post your tight fast code that also sums the variance on [Code Review]( http://codereview.stackexchange.com/) for performance improvement ideas. Trying to avoid the variance calculation is not leading to a solution. – chux - Reinstate Monica Dec 19 '15 at 03:30
  • Many tks..I posted a answer to my own question as below, because the comments here was too long. I posted there a example of the code – guga Dec 19 '15 at 13:24

1 Answers1

0

The Standard deviation, mean, variance, etc. can be calculated by simply keeping a running tally of some calculations. The sum of the sets can be added by also keeping a running sum of the product of the 2 data points. Ref

The calculation of STD is sensitive to subtraction. Keeping sumxx, sumxy, sumyy to higher precision is recommended. The below uses long, and long long. A FP implementation could use double and long double. Details of higher precision issues are not expressed deeply here other to to use them with the running summations.

#include <stdio.h>
#include <math.h>

struct stat2 {
  long sumx;
  long sumy;
  long long sumxx;
  long long sumxy;
  long long sumyy;
  size_t count;
};

void stat2_add(struct stat2 *stat, long x, long y) {
  stat->sumx += x;
  stat->sumxx += 1LL * x * x;
  stat->sumy += y;
  stat->sumyy += 1LL * y * y;

  // This is the only extra reoccurring work needed to meet OP's goal
  stat->sumxy += 1LL * x * y; 

  stat->count++;
}

double stat2_avg(const struct stat2 *stat, int index) {
  switch (index) {
    case 'x':
      return 1.0 * stat->sumx / stat->count;
    case 'y':
      return 1.0 * stat->sumy / stat->count;
    default:
      return 1.0 * (stat->sumx + stat->sumy) / stat->count;
  }
}

double stat2_std(const struct stat2 *stat, int index) {
  double offset = 0.0;  // or 1.0 depending on STD model
  double var;
  switch (index) {
    case 'x':
      var = (stat->sumxx - 1.0 * stat->sumx * stat->sumx / stat->count)
          / (stat->count - offset);
      break;
    case 'y':
      var = (stat->sumyy - 1.0 * stat->sumy * stat->sumy / stat->count)
          / (stat->count - offset);
      break;
    default: {
      // SUM(x+y) = SUM(x) + SUM(y)
      double z = stat->sumx + stat->sumy;
      // SUM((x+y)*(x+y)) = SUM(x*x) + 2*SUM(x*y) + SUM(y*y)
      double zz = stat->sumxx + 2LL * stat->sumxy + stat->sumyy;
      var = (zz - 1.0 * z * z / stat->count) / (stat->count - offset);
    }
  }
  return sqrt(var);
}

void stat2_report(const struct stat2 *stat, const char *title) {
  printf("%s\n", title);
  printf("  x   Avg:%9f  STD:%f\n", stat2_avg(stat, 'x'), stat2_std(stat, 'x'));
  printf("  y   Avg:%9f  STD:%f\n", stat2_avg(stat, 'y'), stat2_std(stat, 'y'));
  printf("  x+y Avg:%9f  STD:%f\n", stat2_avg(stat, 'z'), stat2_std(stat, 'z'));
}

int main(void) {
  size_t i;
  struct stat2 A = { 0, 0, 0, 0, 0 };
  int dataA[] = { 5, 10, 15, 20, 25 };
  int dataB[] = { 2, 4, 11, 7, 16 };
  for (i = 0; i < 5; i++)
    stat2_add(&A, dataA[i], dataB[i]);
  stat2_report(&A, "A");
  return 0;
}

Output

A
  x   Avg:15.000000  STD:7.071068
  y   Avg: 8.000000  STD:5.019960
  x+y Avg:23.000000  STD:11.713240
chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
  • Thanks, a lot :) But. There is a small problem. The mean (Average) of A+B = 23 and not 11,5 (5+2 + 10+4 + 15+11 + 20+7 + 25+16)/5 And the resultand Std (population) is: 11,71324037. – guga Dec 17 '15 at 16:50
  • @guga I had a different understand of what you wanted by "adding sets". To me it was a set of 5 samples and then another set of 5 samples. I see now more clearly your interests and will need to amend my answer. – chux - Reinstate Monica Dec 17 '15 at 16:55
  • @guga repaired. This assumes the data points of both sets are available at the same time. – chux - Reinstate Monica Dec 17 '15 at 17:46
  • Many tks...results are correct. But, can it be done without the extra loop addition ? (stat2_add). I mean, using a formula. Assume that you already have the STD data (Variance, mean, etc) from dataA an dataB and what to calculate new valeus biased on the results you found. For example from DataA, Mean = 15, and DataB, Mean = 8. So, summing both (without the extra loop, using only simple maths) you have. DataC, Mean = 23 (15+8). Is it possible to calculate the variance on the same way, once you already have the values of the variances (and STD) of DataA and DataB ? – guga Dec 17 '15 at 18:48
  • I´m asking this due of the properties of the STD per se. For example if you already a STD achieved from a DataSet from which each element was multiplied by a Factor of 3 (a fixed value), then you have a New STD with the following formula: NewMean = Factor x Oldmean1 NewSquaredMean = Factor x Factor*OldSquaredMean NewVariance = Factor x Factor x OldVariance NewSTD = Factor x OldSTD NewMax = Factor x OldMax NewMin = Factor x OldMin – guga Dec 17 '15 at 18:52
  • @guga In general - no. Confident you need the variance of AB. Yet if data set 1 and 2 have a strong correlation, than I think the answer is yes. (of course that implies a negligible variance) If set 1 and 2 have a fixed relationship (1.0 correlation known ahead of time) then yes. If you can tolerate some error, I think a reasonable estimate can be made. Can not get something for nothing. – chux - Reinstate Monica Dec 17 '15 at 19:01
  • @OP I suspect `stat->sumxy += 1LL * x * y;` is the way to go and optimize other attributes of the (unposted) code. – chux - Reinstate Monica Dec 17 '15 at 19:03
  • @guga For example, the 5 pairs of data have a 0.87 correlation with `y=0.62*x - 1.3`. If we have _some_ linear estimate of the 2 arrays, we could get `STD:11.63` rather than the correct `11.71` **without** that "extra" `stat->sumxy += 1LL * x * y;` – chux - Reinstate Monica Dec 17 '15 at 19:16
  • Tks for the answer. Ok, but, so, to achieve the correct results there must have a fixed correlation between the data, but, i wonder if there is a formula for when there is no correlation (Biased on the sum of their squares perhaps ?) to find the variance without having to compute another "Set" or adding some extra variable on each element. This will take extra computation time. I succeeded to optimize the algo even being forced to add the extra code, but, considering that the function perform in general something around 9 millions loops before find a good data, this extra code = slow ! – guga Dec 19 '15 at 02:47
  • @guga Based upon "no correlation" or "unknown correlation"? Hmmm. – chux - Reinstate Monica Dec 19 '15 at 03:30
  • My mistake. Unknown correlation. :) – guga Dec 19 '15 at 13:06