0

What's the best way to calculate min/avg/max/std-dev for some random data in shell?

What if one has several columns per line, and needs to calculate the statistics for each one?

Sample input (based on processing of the hping output), with the columns 3, 4 and 5 being of interest:

0       145.5   146 = 75 + 71
1       142.7   142 = 72 + 70
2       140.7   140 = 70 + 70
3       146.7   146 = 76 + 70
4       148.3   148 = 77 + 71
5       157.5   157 = 87 + 70
6       167.1   167 = 96 + 71
7       166.3   166 = 95 + 71
8       167.7   167 = 97 + 70
9       159.0   159 = 88 + 71
10      156.7   156 = 86 + 70
11      154.9   155 = 84 + 71
12      151.9   152 = 81 + 71
13      157.3   157 = 86 + 71
14      155.0   155 = 84 + 71
15      157.7   158 = 87 + 71
16      156.6   156 = 86 + 70

(Note that this input is a live stream ad infinitum.)

Community
  • 1
  • 1
cnst
  • 25,870
  • 6
  • 90
  • 122
  • 1
    I would say Perl script. Then again, i use Perl scripts for everything i do not need C++ for performance... – DeVadder Nov 25 '13 at 08:28
  • Are you saying you want running statistics for a continuous stream? That will become rather useless after it has been running for a month or two as it will barely change from day to day. – Borodin Nov 25 '13 at 08:59
  • @Borodin, how do you use `ping(8)`? This is the same thing. What I mean is that there probably has to be an interrupt handler for `^C`, after which the summary is printed. – cnst Nov 25 '13 at 18:25

1 Answers1

2

I suggest you use Perl and keep a running total of N, Σx, and Σx², as well as the minimum and maximum x values. All of the values you need can be derived from those.

This example demonstrates. It dumps the current statistics after each line of the input is read.

use strict;
use warnings;

my ($n, @sum, @sumsq, @min, @max);

while (<DATA>) {

  my @columns = /[0-9.]+/g;

  my (@mean, @std_dev);
  ++$n;
  for my $i (0 .. 2) {
    my $x = $columns[$i + 2];
    my $xsq = $x * $x;

    $sum[$i] += $x;
    $sumsq[$i] += $xsq;

    $mean[$i] = $sum[$i] / $n;
    $std_dev[$i] = sqrt($sumsq[$i]/$n - $mean[$i] * $mean[$i]);
    $min[$i] = $x unless defined $min[$i] and $min[$i] <= $x;
    $max[$i] = $x unless defined $max[$i] and $max[$i] >= $x;
  }

  print "min     = @min\n";
  print "max     = @max\n";
  print "mean    = @mean\n";
  print "std_dev = @std_dev\n";
  print "---\n";
}

__DATA__
0       145.5   146 = 75 + 71
1       142.7   142 = 72 + 70
2       140.7   140 = 70 + 70
3       146.7   146 = 76 + 70
4       148.3   148 = 77 + 71
5       157.5   157 = 87 + 70
6       167.1   167 = 96 + 71
7       166.3   166 = 95 + 71
8       167.7   167 = 97 + 70
9       159.0   159 = 88 + 71
10      156.7   156 = 86 + 70
11      154.9   155 = 84 + 71
12      151.9   152 = 81 + 71
13      157.3   157 = 86 + 71
14      155.0   155 = 84 + 71
15      157.7   158 = 87 + 71
16      156.6   156 = 86 + 70

output

min     = 146 75 71
max     = 146 75 71
mean    = 146 75 71
std_dev = 0 0 0
---
min     = 142 72 70
max     = 146 75 71
mean    = 144 73.5 70.5
std_dev = 2 1.5 0.5
---
min     = 140 70 70
max     = 146 75 71
mean    = 142.666666666667 72.3333333333333 70.3333333333333
std_dev = 2.4944382578501 2.05480466765642 0.47140452079146
---
min     = 140 70 70
max     = 146 76 71
mean    = 143.5 73.25 70.25
std_dev = 2.59807621135332 2.38484800354236 0.433012701892219
---
min     = 140 70 70
max     = 148 77 71
mean    = 144.4 74 70.4
std_dev = 2.93938769133971 2.60768096208109 0.489897948555485
---
min     = 140 70 70
max     = 157 87 71
mean    = 146.5 76.1666666666667 70.3333333333333
std_dev = 5.40832691319598 5.39804491356711 0.47140452079146
---
min     = 140 70 70
max     = 167 96 71
mean    = 149.428571428571 79 70.4285714285714
std_dev = 8.74817765279739 8.55235974119756 0.494871659305337
---
min     = 140 70 70
max     = 167 96 71
mean    = 151.5 81 70.5
std_dev = 9.8488578017961 9.59166304662544 0.5
---
min     = 140 70 70
max     = 167 97 71
mean    = 153.222222222222 82.7777777777778 70.4444444444444
std_dev = 10.4857339888036 10.3470637571759 0.496903995000609
---
min     = 140 70 70
max     = 167 97 71
mean    = 153.8 83.3 70.5
std_dev = 10.0975244490914 9.94032192637645 0.5
---
min     = 140 70 70
max     = 167 97 71
mean    = 154 83.5454545454545 70.4545454545455
std_dev = 9.64836302648838 9.50945592902742 0.497929597732158
---
min     = 140 70 70
max     = 167 97 71
mean    = 154.083333333333 83.5833333333333 70.5
std_dev = 9.24173805202349 9.10547759440561 0.5
---
min     = 140 70 70
max     = 167 97 71
mean    = 153.923076923077 83.3846153846154 70.5384615384615
std_dev = 8.89651218141581 8.77530154238378 0.498518515262866
---
min     = 140 70 70
max     = 167 97 71
mean    = 154.142857142857 83.5714285714286 70.5714285714286
std_dev = 8.60943952761114 8.48287590817347 0.494871659305337
---
min     = 140 70 70
max     = 167 97 71
mean    = 154.2 83.6 70.6
std_dev = 8.32025640630559 8.19593395125498 0.489897948558269
---
min     = 140 70 70
max     = 167 97 71
mean    = 154.4375 83.8125 70.625
std_dev = 8.10839649684202 7.97824189593171 0.484122918275927
---
min     = 140 70 70
max     = 167 97 71
mean    = 154.529411764706 83.9411764705882 70.5882352941177
std_dev = 7.874886718579 7.75712642546343 0.492152956783766
---
Borodin
  • 126,100
  • 9
  • 70
  • 144
  • @Zaid: Absolutely not. The statistics are calculated afresh for each line from the running totals and are always exact. – Borodin Nov 25 '13 at 11:50
  • @Borodin, nice start, thank you! I think the only thing missing is an interrupt handler, plus it'd be nice to only print one decimal point (and probably switch the order of max and mean). – cnst Nov 25 '13 at 18:31
  • @cnst: Your impudence is unbelievable. *“nice start”* indeed. You asked for help and I gave you help. Stack Overflow is *not* a place where you ask for free software to be written, and certainly not one where you send my feeble attempts back for correction. Write your own damned interrupt handler and take your insolence elsewhere. – Borodin Nov 26 '13 at 05:37