Compute the mean and std over a column with awk

Question

I have this file:

Took:  15.473214149475098  seconds
Took:  12.94953465461731  seconds
Took:  2.235722780227661  seconds
Took:  40.53083419799805  seconds
Took:  21.840606212615967  seconds
Took:  35.777870893478394  seconds
Took:  13.153780221939087  seconds
Took:  2.966165781021118  seconds
Took:  35.54965615272522  seconds

I would like to compute the mean and std of the times directly in the terminal. Can awk help ? I am not very familiar with it. I tried splitting the file to get the column with the numerical values only this way : cat <filename> | awk -F "Took:" {print$2} but it just returned the whole content of the file.

and provide the algorithm you want to use to calculate the values you want output. — Ed Morton, Dec 14 '18 at 14:28

RavinderSingh13 · Answer 1 · 2018-12-14T10:06:38.650

3

Could you please try following to get mean of 2nd column.

awk '{sum+=$2;if($2){count++}} END{print sum/count}'  Input_file

EDIT:

awk '{if($2!=""){count++;sum+=$2};y+=$2^2} END{sq=sqrt(y/NR-(sum/NR)^2);sq=sq?sq:0;print "Mean = "sum/count ORS "S.D = ",sq}'  Input_file

edited Dec 14 '18 at 10:06

answered Dec 14 '18 at 08:58

RavinderSingh13

130,504
14
57
93

is it possible to print both mean and std with the same command line ? – dada Dec 14 '18 at 09:16
1

`if($2)` You don't want zeros messing with your results? ;D – James Brown Dec 14 '18 at 09:28
1

Try with: `echo -e 1\\n0\\n2 | awk ...` – James Brown Dec 14 '18 at 09:43
@RavinderSingh13 I downvoted because your script produces wrong results if there is a zero in column 2. – oguz ismail Dec 14 '18 at 09:46
Since you have `if($2)` it will not process zero values and the average of 0,1,2 is 1.5, not 1. – James Brown Dec 14 '18 at 09:46
1

@RavinderSingh13 I think we should, that's how you produce reliable code. – oguz ismail Dec 14 '18 at 09:49
1

@RavinderSingh13 The average of 0,1,2 is 1, your script gives 1.5 because it does not process zeroes due to `if($2)`, replace it with `if($2!="")`or something. – James Brown Dec 14 '18 at 10:03
1

It outputs one, so I'll give you one. ;D – James Brown Dec 14 '18 at 10:08

gboffi · Answer 2 · 2018-12-14T21:28:08.627

3

The Wikipedia page on Standard deviation has an interesting section, "Rapid calculation methods". Of particular interest is the Welford's algorithm, that is simple and numerically stable:

A_0, Q_0 = 0, 0
for k in (1, ...):
    j = k-1
    A_k = A_j + (X_k-A_j)/k
    Q_k = Q_j + (X_k-A_j)*(X_k-A_k)

where, at every step, A_k is equal to the running mean and Q_k is related to the population variance σ² by the relation Q_k = σ²*k.

With this theoretical background, we can write

$ awk 'BEGIN{a=0;q=0}{x=$2;b=a+(x-a)/NR;q+=(x-a)*(x-b);a=b}END{print a,sqrt(q/NR)}' file

edited Dec 14 '18 at 21:28

answered Dec 14 '18 at 10:12

gboffi

22,939
8
54
85

`BEGIN{a=0;q=0}` is not strictly necessary because in Awk numerical variables are automagically initialized to 0 but I've liked to mimic as close as possible the published algorithm. In other words, the one-liner `awk '{x=$2;b=a+(x-a)/NR;q+=(x-a)*(x-b);a=b}END{print a,sqrt(q/NR)}'` is equivalent to the one in the answer. – gboffi Dec 14 '18 at 10:42
last link is broken. – karakfa Dec 14 '18 at 14:04
@karakfa changed the link, thanks a lot for the heads up – gboffi Dec 14 '18 at 21:29

score 2 · Answer 3 · answered Dec 14 '18 at 13:58

2

another quick way,

$ awk '{s+=$2; ss+=$2^2} END{print m=s/NR, sqrt(ss/NR-m^2)}' file

20.053 13.4924

answered Dec 14 '18 at 13:58

karakfa

66,216
7
41
56

`print m=1, m*2` — never seen that, it's a nice trick, isn't it? – gboffi Dec 16 '18 at 01:10
assignments carry the value, that's why you can write `a=b=1` as well. – karakfa Dec 16 '18 at 01:11

oguz ismail · Answer 4 · 2018-12-14T09:23:08.080

1

$ cat tst.awk
{ numbers[NR] = $2; sum += $2 }
END {
    mean = sum / length(numbers)
    # calculate std deviation
    for (i in numbers) {
        dif = numbers[i] - mean
        std += dif ^ 2
    }
    std = sqrt(std / length(numbers))

    print "Mean: " mean
    print "Standart Deviation: " std
}
$
$ awk -f tst.awk file
Mean: 20.053
Standart Deviation: 13.4924

edited Dec 14 '18 at 09:23

answered Dec 14 '18 at 09:20

oguz ismail

1
16
47
69

score 1 · Answer 5 · answered Dec 14 '18 at 12:01

Using Perl one-liner

> cat dada.txt 
Took:  15.473214149475098  seconds
Took:  12.94953465461731  seconds
Took:  2.235722780227661  seconds
Took:  40.53083419799805  seconds
Took:  21.840606212615967  seconds
Took:  35.777870893478394  seconds
Took:  13.153780221939087  seconds
Took:  2.966165781021118  seconds
Took:  35.54965615272522  seconds
> perl -lane '$s+=$F[1];push(@a,$F[1]); END { $m=$s/@a; $sd+=($_-$m)**2 for(@a);$sd=sqrt($sd/@a); print "Mean:$m\nStandard Deviation:$sd"} ' dada.txt
Mean:20.0530427826775
Standard Deviation:13.4923983082523
>

Compute the mean and std over a column with awk

5 Answers5