Mean and Confidence Interval to plot

Question

I have a data file with tab-separated columns which looks like this:

6   27  4   12  20  100 50000   false   0.1 "DFSA"  2   201414  31.47408
3   27  4   12  20  100 50000   false   0.1 "DFSA"  2   204236  31.91436
4   27  4   12  20  100 50000   false   0.1 "DFSA"  2   206964  32.09382
8   27  4   12  20  100 50000   false   0.1 "DFSA"  2   203379  31.88138
1   27  4   12  20  100 50000   false   0.1 "DFSA"  2   207287  32.4096
2   27  4   12  20  100 50000   false   0.1 "DFSA"  2   207115  32.61346
7   27  4   12  20  100 50000   false   0.1 "DFSA"  2   207751  32.77364
5   27  4   12  20  100 50000   false   0.1 "DFSA"  2   209856  32.77856
13  27  4   12  20  100 50000   false   0.1 "DFSA"  2   204257  32.08478
10  27  4   12  20  100 50000   false   0.1 "DFSA"  2   207335  32.22742
9   27  4   12  20  100 50000   false   0.1 "DFSA"  2   200518  31.64624
11  27  4   12  20  100 50000   false   0.1 "DFSA"  2   202415  32.0591
16  27  4   12  20  100 50000   false   0.1 "DFSA"  2   201548  31.59604
12  27  4   12  20  100 50000   false   0.1 "DFSA"  2   213840  33.27478
15  27  4   12  20  100 50000   false   0.1 "DFSA"  2   208926  32.37782
14  27  4   12  20  100 50000   false   0.1 "DFSA"  2   210572  32.81294
17  27  4   12  20  100 50000   false   0.1 "DFSA"  2   204724  32.26324
20  27  4   12  20  100 50000   false   0.1 "DFSA"  2   199169  31.53494
18  27  4   12  20  100 50000   false   0.1 "DFSA"  2   208179  32.73408
19  27  4   12  20  100 50000   false   0.1 "DFSA"  2   204342  31.82608
21  27  4   12  20  100 50000   false   0.1 "DFSA"  2   205068  31.986
24  27  4   12  20  100 50000   false   0.1 "DFSA"  2   207798  32.49448
22  27  4   12  20  100 50000   false   0.1 "DFSA"  2   207499  32.60746
23  27  4   12  20  100 50000   false   0.1 "DFSA"  2   214065  33.17844
25  27  4   12  20  100 50000   false   0.1 "DFSA"  2   198386  31.4488
26  27  4   12  20  100 50000   false   0.1 "DFSA"  2   208307  32.77412
28  27  4   12  20  100 50000   false   0.1 "DFSA"  2   208669  32.77474
27  27  4   12  20  100 50000   false   0.1 "DFSA"  2   214324  33.16506
30  27  4   12  20  100 50000   false   0.1 "DFSA"  2   209924  32.77186
29  27  4   12  20  100 50000   false   0.1 "DFSA"  2   212959  33.3218
32  27  4   12  20  100 50000   false   0.2 "DFSA"  2   486241  64.3479
31  27  4   12  20  100 50000   false   0.2 "DFSA"  2   487487  64.65076
33  27  4   12  20  100 50000   false   0.2 "DFSA"  2   494703  65.06718
34  27  4   12  20  100 50000   false   0.2 "DFSA"  2   488164  64.77828
36  27  4   12  20  100 50000   false   0.2 "DFSA"  2   476513  63.3158
35  27  4   12  20  100 50000   false   0.2 "DFSA"  2   491005  65.02426
38  27  4   12  20  100 50000   false   0.2 "DFSA"  2   487454  64.44962
37  27  4   12  20  100 50000   false   0.2 "DFSA"  2   490494  65.06572
39  27  4   12  20  100 50000   false   0.2 "DFSA"  2   472081  63.31234
40  27  4   12  20  100 50000   false   0.2 "DFSA"  2   498294  65.02114

The data continues down growing the value in column 9 (30 rows with value 0.1, 30 rows with value 0.2, another 30 rows with 0.3 and so on). I would like to plot with yerrorlines in Gnuplot the mean value column 13 for each of the 30 rows with the same value of column 9. To make the plot, I should pass Gnuplot column 9 value, mean of column 13 of the 30 rows, and the min value and the max value of the confidence interval (e.g. 95%). In the case of 0.1, the row to pass to Gnuplot should be like the following:

0.1  36.08  35.83  36.33
0.2  .....  .....  .....
0.3  .....  .....  .....
...  .....  .....  .....

Assuming a 95% confidence interval is +0.25 and +0.25 around the mean of 36.08

Thank you for your help Regards

The goal is that you add some code of your own to show at least the research effort you made to solve this yourself. — Cyrus, Feb 01 '18 at 05:40
sorry but it is not clear how you compute your average... where do the `36.08` `35.83` `36.33` come from? — Allan, Feb 01 '18 at 05:58
@Allan thanks to comment. The value 36.08 is the mean of last column values where column 8 has de value 0.1. The 35.83 is the min value of the confidence interval and 36.33 de max value, if the confidence interval is 0.5. — user1993416, Feb 01 '18 at 08:43

score 3 · Answer 1 · edited Jun 20 '20 at 09:12

Answer

awk 'BEGIN{sum=0;pat=0;cnt=0}{if(pat==$9) {sum=sum+$13;cnt++} else {if (cnt!=0) {printf "%s %2.2f %2.2f %2.2f\n" ,pat,(sum/cnt)*0.95,sum/cnt,sum/cnt*1.05} pat=$9;sum=0;cnt=0} }END{printf "%s %2.2f %2.2f %2.2f\n" ,pat,(sum/cnt)*0.95,sum/cnt,sum/cnt*1.05}' data_table.txt

ARGS:

1.pat - number in column 9

2.sum - contains sums of column 13 of all lines that pat is the same

3.cnt - contains the times that pat was red

Tear down

BEGIN{sum=0;pat=0;cnt=0} - set all variables
if(pat==$9) - check if pattern is the same if it the same form mean and increment ctn
else - if cnt!=0 (pat line was summed for mean) print the wanted value to screen
printf "%s %2.2f %2.2f %2.2f\n" ,pat,(sum/cnt)*0.95,sum/cnt,sum/cnt*1.05 %2.2f get floating point 00.00 structure
END print mean for last pat

Output (for file entered above)

 0.1 30.77 32.39 34.01
 0.2 61.29 64.52 67.75

thank you very much. The awk code is perfect. I only will need to change the formula for the min and max value of the confidence interval to compute using the t-student pdf. — user1993416, Feb 01 '18 at 08:47

score 3 · Accepted Answer · answered Feb 01 '18 at 18:44

I know the question has the tag awk, but since gnuplot was mentioned, may I suggest a gnuplot-only solution?

Gnuplot has a smooth unique function that would look for unique values (in your case column 9) and do a simple average of the y-values (in your case column 13). You can plot this directly, but with a few extra lines you can make your output much better.

Using the input file from the question, and the following code:

set term png
set out "test.png"

set table "tab.tmp"
plot "data.txt" u 9:13 smooth unique
unset table

set xrange[0.09:0.21]
set yrange[25:75]

set style fill transparent solid 0.2 noborder

set key top left
set multiplot layout 2,1

plot "tab.tmp" u 1:(strcol(3) eq "u" ? 1/0: $2*0.95):($2*1.05) 
     with filledcurves title '95% confidence', \
     "tab.tmp" u 1:(strcol(3) eq "u" ? 1/0: $2) 
     with lp lt 1 pt 7 ps 1.0 lw 3 title 'mean value'

plot "tab.tmp" u 1:(strcol(3) eq "u" ? 1/0: $2):($2*0.95):($2*1.05) \
     with yerrorlines lt 1 pt 7 ps 1.0 lw 3 title 'value+confidence'

The set table part creates a new table with unique values and their average:

# Curve 0 of 1, 3 points
# Curve title: ""tmp" u 9:13"
# x y type
0.1  32.3633  i
0.2  64.5033  i
0.1  32.0938  u

Then you can treat this file as a regular data file, and plot as you wish. The only trick comes here: (strcol(3) eq "u" ? 1/0: $2). This is a simple conditional telling gnuplot to ignore lines where column 3 is "u" (undefined). Then you can directly calculate your confidence interval as you wish.

Running these lines would give you the following output, with two suggested layouts:

A bit of help from here and here

Anyway, hope it helps!

Thanks. You are right, I have added the `Gnuplot` to the question. I like very much Gnuplot only solution, but I do not know if it is possible to add a computation of the confidence interval using the formula `mean +- sqrt{ \sum{x_i - mean}^2/(count-1)} / sqrt{count}` instead of mean*1.05 and mean*0.95. — user1993416, Feb 01 '18 at 19:34
Yes, it is actually possible. Gnuplot has a built-in `stats` function. Using your input file, try `stats "datafile" u ($9==0.1?$13:1/0)`. This gives you basic numbers for column 13 when column 9==0.1. Then, gnuplot stores all of these in variables, and you can, for example, do `print STATS_records` (count), `STATS_sum` (sum), and `STATS_mean` (mean). It would take a bit extra coding to include on the plot, but I suspect not a lot. — Vinicius Placco, Feb 01 '18 at 19:44
Thank you. I will try Gnuplot `stats`. May I ask you what it does the 1/0 in `($9==0.1?$13:1/0)` ?. — user1993416, Feb 01 '18 at 20:22
From the `gnuplot` manual: The integer expression "1/0" may be used to generate an "undefined" flag, which causes a point to ignored. So `($9==0.1?$13:1/0)` says: if field $9 is 0.1, then print (plot) field $13; otherwise ignore this point — Vinicius Placco, Feb 01 '18 at 20:40

user1993416 · Answer 3 · 2018-02-01T16:35:21.517

I would like to complete the script above of @shaikisiegal adding the formula to compute the confidence interval using the t-student pdf.

awk '
BEGIN {sum=0;pat=0;cnt=0;summ=0}{
   if(pat==$9) {sum=sum+$13; arr[cnt]=$13; cnt++} else {
     if (cnt!=0) {
         for (i=1;i<=cnt;i++) {
             summ+=(arr[i]-(sum/cnt))^2
         }
          printf "%s %2.2f %2.2f %2.2f\n" ,pat, sum/cnt, (sum/cnt)-2.045*sqrt(summ/(cnt-1))/sqrt(cnt), (sum/cnt)+2.045*sqrt(summ/(cnt-1))/sqrt(cnt)
      } 
    pat=$9;sum=0;cnt=0;summ=0
    } 
}
END {
  for (i=1;i<=cnt;i++) {
             summ+=(arr[i]-(sum/cnt))^2
         }
  printf "%s %2.2f %2.2f %2.2f\n" ,pat, sum/cnt, (sum/cnt)-2.045*sqrt(summ/(cnt-1))/sqrt(cnt), (sum/cnt)+2.045*sqrt(summ/(cnt-1))/sqrt(cnt)
  }
  ' data_table.txt

where 2.045 is the tabular value for 95% and 30 data samples of the distribution table of the t-student pdf.

Mean and Confidence Interval to plot

3 Answers3

Answer

Tear down

Output (for file entered above)