5

Redoing the weighted mean (which is already in another column) in working out the weighted-Sum-Of-Squared-Deviations, results in the error "Cannot perform an aggregate function on an expression containing an aggregate or a subquery."

SQRT(SUM(wt.value*SQUARE(out1.value-(SUM(ISNULL(out1.value,0)*wt.value)/SUM(wt.value))))
    / (((COUNT(wt.value)-1)*SUM(wt.value))/(COUNT(wt.value)))) as wsdevInc0

Can you suggest how to calculate the WEIGHTED STANDARD DEVIATION in SQL, short of adding another level of 'SELECT' to the query hierarchy (and probably duplicating the selection of data)?

user1444275
  • 51
  • 1
  • 2

1 Answers1

11

I just encountered this same problem and error message. The way to solve it is to rewrite the weighted standard deviation formula.

The population variance formula, using <>'s to denote an average, which also applies to a weighted average, is:

variance = <(x - <x>)^2>

This form contains an aggregate function embedded in another aggregate function, hence the SQL error message.

The solution is to rewrite the formula without nested aggregations:

variance = <x^2> - <x>^2

This directly follow by multiplying out the previous formula, and noting that <x<x>> = <x><x>.

The standard deviation is just the square root of the variance, so the SQL becomes:

SQRT( SUM(w*SQUARE(x))/SUM(w) - SQUARE(SUM(w*x)/SUM(w)) )

where w is the weight, and x is the statistical variable.

The above formulas are applicable to a population data set. For a sample data set, multiply the above SQL value by

SQRT( COUNT(*)/(COUNT(*)-1) )
Matt
  • 20,108
  • 1
  • 57
  • 70
  • 1
    for weighted values shouldn't the above formulas be multiplied by SQRT(SUM(w)/(SUM(w)-1)) instead of SQRT( COUNT( * )/(COUNT( * )-1) ) ? – decal Jul 19 '16 at 20:05
  • @decal No I don't think so. The `N-1` that shows up in sample data formulas is counting degress of freedom; the number of data points. Additionally, if `SUM(w)=1` then your formula gives `SQRT(1/0)`. – Matt Dec 20 '19 at 15:05
  • I believe @decal to be correct. If we treat the weights as a count of the occurrences of an event so in a sense, the number of data points is the sum of the weights. – Chris Aug 22 '20 at 01:38
  • I confirmed in R that decal is correct using the Hmisc wtd.var function. – Chris Aug 22 '20 at 01:47
  • @Chris It sounds like you are agreeing with me. If you define the weights as the count, then you get the same formula in my answer above. However, if you define weights differently, for example if `SUM(w) = 1` then you end up with `SQRT(1/0)`. – Matt Aug 22 '20 at 14:12
  • The estimated standard deviation of one sample is undefined. – Chris Aug 23 '20 at 01:25
  • @Chris Let's take a specific example: 5 sample points (10,20,30,40,50) with equal weights (0.2, 0.2, 0.2, 0.2, 0.2). Can you use this data to demonstrate your above claim that "I believe decal to be correct"? – Matt Aug 23 '20 at 14:08
  • As stated before, in the context of treating the weights as a count of events... In that case a weight of 0.2 doesn't make sense. I was and am only commenting on that scenario, not on other weight usage. In the case of 0.2 weights, I hold no opinion on that matter. – Chris Aug 24 '20 at 02:47
  • @Chris Exactly, so for arbitary weights, the correct multiplicative factor is `SQRT( COUNT(*)/(COUNT(*)-1) )`, not `SQRT(SUM(w)/(SUM(w)-1))`. You are in fact agreeing with my above answer, not decal's comment. – Matt Aug 24 '20 at 12:17
  • No, I 100% agree with decal's comment for weights that indicate counts. For arbitrary weights, I have no idea. The author of Hmisc suggests to normalize the weights to sum to the length of the data thus avoiding the 0.2 situation. This is my final comment on this matter. – Chris Aug 25 '20 at 18:10