4

Standard deviation analysis can be a useful way to find outliers. Is there a way to incorporate the result of this query (finding the value of the fourth standard deviation away from the mean)...

SELECT (AVG(weight_pounds) + STDDEV(weight_pounds) * 4) as high FROM [publicdata:samples.natality];

result = 12.721342001626912

...Into another query that produces information about which states and dates have the most babies born heavier that 4 standard deviations from average?

SELECT state, year, month ,COUNT(*) AS outlier_count
 FROM [publicdata:samples.natality]
WHERE
  (weight_pounds > 12.721342001626912)
AND
  (state != '' AND state IS NOT NULL)
GROUP BY state, year, month 
ORDER BY outlier_count DESC;

Result:

Row  state   year    month   outlier_count    
1    MD  1990    12  22   
2    NY  1989    10  17   
3    CA  1991    9   14

Essentially it would be great to combine this into a single query.

Michael Manoochehri
  • 7,931
  • 6
  • 33
  • 47

1 Answers1

7

You can abuse JOIN for this (and thus performance will suffer):

SELECT n.state, n.year, n.month ,COUNT(*) AS outlier_count
FROM (
  SELECT state, year, month, weight_pounds, 1 as key 
  FROM [publicdata:samples.natality]) as n
JOIN (
  SELECT (AVG(weight_pounds) + STDDEV(weight_pounds) * 4) as giant_baby, 
          1 as key 
  FROM [publicdata:samples.natality]) as o
ON n.key = o.key
WHERE
  (n.weight_pounds > o.giant_baby)
AND
  (n.state != '' AND n.state IS NOT NULL)
GROUP BY n.state, n.year, n.month 
ORDER BY outlier_count DESC;
Jordan Tigani
  • 26,089
  • 4
  • 60
  • 63
  • 4
    I think this is right... but whether it's right or not, I'm giving +1 for the "giant_baby" alias which is still making me giggle as I type this. – mdahlman Sep 21 '12 at 18:04
  • 1
    Also, I think the BigQuery community needs to do more analysis to determine exactly why Maryland had so many giant babies in December 1990. – Michael Manoochehri Sep 21 '12 at 19:05