12

Is there some trick that would allow one to use bc (or some other standard utility) to return the standard deviation of an arbitrary number of numbers? For convenience, let's say that the numbers are stored in a Bash variable in the following way:

myNumbers="0.556
1.456
45.111
7.812
5.001"

So, the answer I'm looking for would be in a form such as the following:

standardDeviation="$(echo "${myNumbers}" | <insert magic here>)"
chepner
  • 497,756
  • 71
  • 530
  • 681
d3pd
  • 7,935
  • 24
  • 76
  • 127

5 Answers5

15

Using :

standardDeviation=$(
    echo "$myNumbers" |
        awk '{sum+=$1; sumsq+=$1*$1}END{print sqrt(sumsq/NR - (sum/NR)**2)}'
)
echo $standardDeviation

Using :

#!/usr/bin/env perl

use strict; use warnings;
use Math::NumberCruncher;

my @data = qw/
    0.556
    1.456
    45.111
    7.812
    5.001
/;

print Math::NumberCruncher::StandardDeviation(\@data);

Output

16.7631
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
  • That's a nice little awk cycle there and, naturally, one can't argue against the sensibility of using Perl here. Thanks for your help! – d3pd Feb 27 '13 at 12:22
  • 1
    I get 'syntax error at or near *' because of the `**2`, replacing it with `*(sum/NR)` fixes this. –  Mar 09 '16 at 19:04
  • Nice. A note that this is population standard deviation vs sample standard deviation... – dawg Aug 01 '17 at 17:44
5

Population standard deviation:

jq -s '(add/length)as$a|map(pow(.-$a;2))|add/length|sqrt'
ruby -e'a=readlines.map(&:to_f);puts (a.map{|x|(x-a.reduce(:+)/a.length)**2}.reduce(:+)/a.length)**0.5'
jq -s '(map(.*.)|add/length)-pow(add/length;2)|sqrt'
awk '{x+=$0;y+=$0^2}END{print sqrt(y/NR-(x/NR)^2)}'

In awk, ^ is in POSIX but ** is not. ** is supported by gawk and nawk but not by mawk.

Sample standard deviation (the first two commands are the same as the first two commands above, but length was replaced with length-1):

jq -s '(add/length)as$a|map(pow(.-$a;2))|add/(length-1)|sqrt'
ruby -e'a=readlines.map(&:to_f);puts (a.map{|x|(x-a.reduce(:+)/a.length)**2}.reduce(:+)/(a.length-1))**0.5'
R -q -e 'sd(scan("stdin"))'
nisetama
  • 7,764
  • 1
  • 34
  • 21
4

Or use GNU Octave (which can much more than simple std):

standardDeviation="$(echo "${myNumbers}" | octave --eval 'disp(std(scanf("%f")))')"
echo $standardDeviation

Outputs

18.742
Andy
  • 7,931
  • 4
  • 25
  • 45
  • At least on my system, the `octave` command briefly launches a graphical application named octave-gui unless I add the `--no-window-system` flag. You can replace `std(scanf("%f"))` with `std(scanf("%f"),1)` to calculate the population standard deviation instead of the sample standard deviation. – nisetama Mar 20 '19 at 05:58
  • 1
    @nisetama the GUI is default since GNU Octave 4.0.x and was changed back for Octave 5.0.x – Andy Mar 20 '19 at 06:28
1

Given:

$ myNumbers=$(echo "0.556 1.456 45.111 7.812 5.001" | tr " " "\n")

First decide if you need sample standard deviation vs population standard deviation of those numbers.

Population standard deviation (the function STDEV.P in Excel) requires the entire population of datum. In Excel, text or blanks are skipped.

It is easily calculated on a rolling basis in awk:

$ echo "$myNumbers" | awk '$1+0==$1 {sum+=$1; sumsq+=$1*$1; cnt++}
                           END{print sumsq/cnt; print sqrt(sumsq/cnt - (sum/cnt)**2)}'
16.7631

Or in Ruby:

$ echo "$myNumbers" | ruby -e 'arr=$<.read.split(/\s/).map { |e| Float(e) rescue nil }.compact
                             sumsq=arr.inject(0) { |acc, e| acc+=e*e }
                             p (sumsq/arr.length - (arr.sum/arr.length)**2)**0.5'
16.76307799182477

For a sample standard deviation (the function STDEV.S in Excel and ignoring text or blanks) You need to have the entire sample collected first since the mean is used against each value in the sample.

In awk:

$ echo "$myNumbers" | 
     awk 'function sdev(array) {
     for (i=1; i in array; i++)
        sum+=array[i]
     cnt=i-1
     mean=sum/cnt
     for (i=1; i in array; i++)  
        sqdif+=(array[i]-mean)**2
     return (sqdif/(cnt-1))**0.5
     }
     $1+0==$1 {sum1[++cnt]=$1} 
     END {print sdev(sum1)}' 
18.7417

Or in Ruby:

$ ruby -lane 'BEGIN{col1=[]}
            col1 << Float($F[0]) rescue nil
            END {col1.compact
                 mean=col1.sum / col1.length
                 p (col1.inject(0){ |acc, e| acc+(e-mean)**2 } / 
                        (col1.length-1))**0.5
              }' <(echo "$myNumbers")
18.741690950925424
dawg
  • 98,345
  • 23
  • 131
  • 206
1

Just for fun, 8 years later, with gnuplot:

echo "${myNumbers}" | gnuplot -e 'stats "-" nooutput; print STATS_stddev'
16.7630779918248

By way of explanation, I am getting gnuplot to run the stats function on the data on its stdin, suppressing the normal output and printing just the standard deviation.


Related, but not really part of answer... you can also generate lots of other statistics, like median, kurtosis and skew, quartiles, maxima, minima like this:

echo "${myNumbers}" | gnuplot -e 'stats "-"'

Sample Output

* FILE: 
  Records:           5
  Out of range:      0
  Invalid:           0
  Header records:    0
  Blank:             0
  Data Blocks:       1

* COLUMN: 
  Mean:              11.9872
  Std Dev:           16.7631
  Sample StdDev:     18.7417
  Skewness:           1.4125
  Kurtosis:           3.1303
  Avg Dev:           13.2495
  Sum:               59.9360
  Sum Sq.:         2123.4687

  Mean Err.:          7.4967
  Std Dev Err.:       5.3010
  Skewness Err.:      1.0954
  Kurtosis Err.:      2.1909

  Minimum:            0.5560 [0]
  Maximum:           45.1110 [2]
  Quartile:           1.4560 
  Median:             5.0010 
  Quartile:           7.8120 
Mark Setchell
  • 191,897
  • 31
  • 273
  • 432