1

I have a data file with tab-separated columns which looks like this:

6   27  4   12  20  100 50000   false   0.1 "DFSA"  2   201414  31.47408
3   27  4   12  20  100 50000   false   0.1 "DFSA"  2   204236  31.91436
4   27  4   12  20  100 50000   false   0.1 "DFSA"  2   206964  32.09382
8   27  4   12  20  100 50000   false   0.1 "DFSA"  2   203379  31.88138
1   27  4   12  20  100 50000   false   0.1 "DFSA"  2   207287  32.4096
2   27  4   12  20  100 50000   false   0.1 "DFSA"  2   207115  32.61346
7   27  4   12  20  100 50000   false   0.1 "DFSA"  2   207751  32.77364
5   27  4   12  20  100 50000   false   0.1 "DFSA"  2   209856  32.77856
13  27  4   12  20  100 50000   false   0.1 "DFSA"  2   204257  32.08478
10  27  4   12  20  100 50000   false   0.1 "DFSA"  2   207335  32.22742
9   27  4   12  20  100 50000   false   0.1 "DFSA"  2   200518  31.64624
11  27  4   12  20  100 50000   false   0.1 "DFSA"  2   202415  32.0591
16  27  4   12  20  100 50000   false   0.1 "DFSA"  2   201548  31.59604
12  27  4   12  20  100 50000   false   0.1 "DFSA"  2   213840  33.27478
15  27  4   12  20  100 50000   false   0.1 "DFSA"  2   208926  32.37782
14  27  4   12  20  100 50000   false   0.1 "DFSA"  2   210572  32.81294
17  27  4   12  20  100 50000   false   0.1 "DFSA"  2   204724  32.26324
20  27  4   12  20  100 50000   false   0.1 "DFSA"  2   199169  31.53494
18  27  4   12  20  100 50000   false   0.1 "DFSA"  2   208179  32.73408
19  27  4   12  20  100 50000   false   0.1 "DFSA"  2   204342  31.82608
21  27  4   12  20  100 50000   false   0.1 "DFSA"  2   205068  31.986
24  27  4   12  20  100 50000   false   0.1 "DFSA"  2   207798  32.49448
22  27  4   12  20  100 50000   false   0.1 "DFSA"  2   207499  32.60746
23  27  4   12  20  100 50000   false   0.1 "DFSA"  2   214065  33.17844
25  27  4   12  20  100 50000   false   0.1 "DFSA"  2   198386  31.4488
26  27  4   12  20  100 50000   false   0.1 "DFSA"  2   208307  32.77412
28  27  4   12  20  100 50000   false   0.1 "DFSA"  2   208669  32.77474
27  27  4   12  20  100 50000   false   0.1 "DFSA"  2   214324  33.16506
30  27  4   12  20  100 50000   false   0.1 "DFSA"  2   209924  32.77186
29  27  4   12  20  100 50000   false   0.1 "DFSA"  2   212959  33.3218
32  27  4   12  20  100 50000   false   0.2 "DFSA"  2   486241  64.3479
31  27  4   12  20  100 50000   false   0.2 "DFSA"  2   487487  64.65076
33  27  4   12  20  100 50000   false   0.2 "DFSA"  2   494703  65.06718
34  27  4   12  20  100 50000   false   0.2 "DFSA"  2   488164  64.77828
36  27  4   12  20  100 50000   false   0.2 "DFSA"  2   476513  63.3158
35  27  4   12  20  100 50000   false   0.2 "DFSA"  2   491005  65.02426
38  27  4   12  20  100 50000   false   0.2 "DFSA"  2   487454  64.44962
37  27  4   12  20  100 50000   false   0.2 "DFSA"  2   490494  65.06572
39  27  4   12  20  100 50000   false   0.2 "DFSA"  2   472081  63.31234
40  27  4   12  20  100 50000   false   0.2 "DFSA"  2   498294  65.02114

The data continues down growing the value in column 9 (30 rows with value 0.1, 30 rows with value 0.2, another 30 rows with 0.3 and so on). I would like to plot with yerrorlines in Gnuplot the mean value column 13 for each of the 30 rows with the same value of column 9. To make the plot, I should pass Gnuplot column 9 value, mean of column 13 of the 30 rows, and the min value and the max value of the confidence interval (e.g. 95%). In the case of 0.1, the row to pass to Gnuplot should be like the following:

0.1  36.08  35.83  36.33
0.2  .....  .....  .....
0.3  .....  .....  .....
...  .....  .....  .....

Assuming a 95% confidence interval is +0.25 and +0.25 around the mean of 36.08

Thank you for your help Regards

user1993416
  • 698
  • 1
  • 9
  • 28
  • The goal is that you add some code of your own to show at least the research effort you made to solve this yourself. – Cyrus Feb 01 '18 at 05:40
  • sorry but it is not clear how you compute your average... where do the `36.08` `35.83` `36.33` come from? – Allan Feb 01 '18 at 05:58
  • @Allan thanks to comment. The value 36.08 is the mean of last column values where column 8 has de value 0.1. The 35.83 is the min value of the confidence interval and 36.33 de max value, if the confidence interval is 0.5. – user1993416 Feb 01 '18 at 08:43

3 Answers3

3

Answer

awk 'BEGIN{sum=0;pat=0;cnt=0}{if(pat==$9) {sum=sum+$13;cnt++} else {if (cnt!=0) {printf "%s %2.2f %2.2f %2.2f\n" ,pat,(sum/cnt)*0.95,sum/cnt,sum/cnt*1.05} pat=$9;sum=0;cnt=0} }END{printf "%s %2.2f %2.2f %2.2f\n" ,pat,(sum/cnt)*0.95,sum/cnt,sum/cnt*1.05}' data_table.txt

ARGS:

1.pat - number in column 9

2.sum - contains sums of column 13 of all lines that pat is the same

3.cnt - contains the times that pat was red

Tear down

  1. BEGIN{sum=0;pat=0;cnt=0} - set all variables
  2. if(pat==$9) - check if pattern is the same if it the same form mean and increment ctn
  3. else - if cnt!=0 (pat line was summed for mean) print the wanted value to screen
  4. printf "%s %2.2f %2.2f %2.2f\n" ,pat,(sum/cnt)*0.95,sum/cnt,sum/cnt*1.05 %2.2f get floating point 00.00 structure
  5. END print mean for last pat

Output (for file entered above)

 0.1 30.77 32.39 34.01
 0.2 61.29 64.52 67.75
Community
  • 1
  • 1
shaiki siegal
  • 392
  • 3
  • 10
  • 1
    thank you very much. The awk code is perfect. I only will need to change the formula for the min and max value of the confidence interval to compute using the t-student pdf. – user1993416 Feb 01 '18 at 08:47
3

I know the question has the tag awk, but since gnuplot was mentioned, may I suggest a gnuplot-only solution?

Gnuplot has a smooth unique function that would look for unique values (in your case column 9) and do a simple average of the y-values (in your case column 13). You can plot this directly, but with a few extra lines you can make your output much better.

Using the input file from the question, and the following code:

set term png
set out "test.png"

set table "tab.tmp"
plot "data.txt" u 9:13 smooth unique
unset table

set xrange[0.09:0.21]
set yrange[25:75]

set style fill transparent solid 0.2 noborder

set key top left
set multiplot layout 2,1

plot "tab.tmp" u 1:(strcol(3) eq "u" ? 1/0: $2*0.95):($2*1.05) 
     with filledcurves title '95% confidence', \
     "tab.tmp" u 1:(strcol(3) eq "u" ? 1/0: $2) 
     with lp lt 1 pt 7 ps 1.0 lw 3 title 'mean value'

plot "tab.tmp" u 1:(strcol(3) eq "u" ? 1/0: $2):($2*0.95):($2*1.05) \
     with yerrorlines lt 1 pt 7 ps 1.0 lw 3 title 'value+confidence'

The set table part creates a new table with unique values and their average:

# Curve 0 of 1, 3 points
# Curve title: ""tmp" u 9:13"
# x y type
0.1  32.3633  i
0.2  64.5033  i
0.1  32.0938  u

Then you can treat this file as a regular data file, and plot as you wish. The only trick comes here: (strcol(3) eq "u" ? 1/0: $2). This is a simple conditional telling gnuplot to ignore lines where column 3 is "u" (undefined). Then you can directly calculate your confidence interval as you wish.

Running these lines would give you the following output, with two suggested layouts: enter image description here

A bit of help from here and here

Anyway, hope it helps!

Vinicius Placco
  • 1,683
  • 2
  • 14
  • 24
  • Thanks. You are right, I have added the `Gnuplot` to the question. I like very much Gnuplot only solution, but I do not know if it is possible to add a computation of the confidence interval using the formula `mean +- sqrt{ \sum{x_i - mean}^2/(count-1)} / sqrt{count}` instead of mean*1.05 and mean*0.95. – user1993416 Feb 01 '18 at 19:34
  • Yes, it is actually possible. Gnuplot has a built-in `stats` function. Using your input file, try `stats "datafile" u ($9==0.1?$13:1/0)`. This gives you basic numbers for column 13 when column 9==0.1. Then, gnuplot stores all of these in variables, and you can, for example, do `print STATS_records` (count), `STATS_sum` (sum), and `STATS_mean` (mean). It would take a bit extra coding to include on the plot, but I suspect not a lot. – Vinicius Placco Feb 01 '18 at 19:44
  • Thank you. I will try Gnuplot `stats`. May I ask you what it does the 1/0 in `($9==0.1?$13:1/0)` ?. – user1993416 Feb 01 '18 at 20:22
  • From the `gnuplot` manual: The integer expression "1/0" may be used to generate an "undefined" flag, which causes a point to ignored. So `($9==0.1?$13:1/0)` says: if field $9 is 0.1, then print (plot) field $13; otherwise ignore this point – Vinicius Placco Feb 01 '18 at 20:40
2

I would like to complete the script above of @shaikisiegal adding the formula to compute the confidence interval using the t-student pdf.

awk '
BEGIN {sum=0;pat=0;cnt=0;summ=0}{
   if(pat==$9) {sum=sum+$13; arr[cnt]=$13; cnt++} else {
     if (cnt!=0) {
         for (i=1;i<=cnt;i++) {
             summ+=(arr[i]-(sum/cnt))^2
         }
          printf "%s %2.2f %2.2f %2.2f\n" ,pat, sum/cnt, (sum/cnt)-2.045*sqrt(summ/(cnt-1))/sqrt(cnt), (sum/cnt)+2.045*sqrt(summ/(cnt-1))/sqrt(cnt)
      } 
    pat=$9;sum=0;cnt=0;summ=0
    } 
}
END {
  for (i=1;i<=cnt;i++) {
             summ+=(arr[i]-(sum/cnt))^2
         }
  printf "%s %2.2f %2.2f %2.2f\n" ,pat, sum/cnt, (sum/cnt)-2.045*sqrt(summ/(cnt-1))/sqrt(cnt), (sum/cnt)+2.045*sqrt(summ/(cnt-1))/sqrt(cnt)
  }
  ' data_table.txt

where 2.045 is the tabular value for 95% and 30 data samples of the distribution table of the t-student pdf.

user1993416
  • 698
  • 1
  • 9
  • 28