0

I have a big ascii-file that looks like this:

12,3,0.12,965.814
11,3,0.22,4313.2
14,3,0.42,7586.22
17,4,0,0
11,4,0,0
15,4,0,0
13,4,0,0
17,4,0,0
11,4,0,0
18,3,0.12,2764.86
12,3,0.22,2058.3
11,3,0.42,2929.62
10,4,0,0
10,4,0,0
14,4,0,0
12,4,0,0
19,3,0.12,1920.64
20,3,0.22,1721.51
12,3,0.42,1841.55
11,4,0,0
15,4,0,0
19,4,0,0
11,4,0,0
13,4,0,0
17,3,0.12,2738.99
12,3,0.22,1719.3
18,3,0.42,3757.72
.
.
.

I want to calculate a selected moving average over three values with awk. The selection should be done by the second and the third column. A moving average should be calculated only for lines with the second column is 3. The I would like to calculate three moving averages selected by the third column (contains for every "block" the same values in the same order). The moving average shall then be calculated of the fourth column. I would like to output the whole line of second moving-average value and replace the fourth column with the result. I know that sounds very complicated, so I will give an example what I want to calculate and also the desired result:

(965.814+2764.86+1920.64)/3 = 1883.77

and output the result together with line 10:

18,3,0.12,1883.77

Then continue with the second, eleventh and eighteenth line...

The end result for my data example shall look like this:

18,3,0.12,1883.77
12,3,0.22,2697.67
11,3,0.42,4119.13
19,3,0.12,2474.83
20,3,0.22,1833.04
12,3,0.42,2842.96

I tried to calculate the moving-average with the following code in awk but think I designed the script wrong because awk tells me syntax error for every "$2 == 3".

BEGIN { FS="," ; OFS = "," }
    $2 == 3 {
        a; b; c; d; e; f = 0        
        line1 = $0; a = $3; b = $4; getline
        line2 = $0; c = $3; d = $4; getline
        line3 = $0; e = $3; f = $4
            $2 == 3 {
                line11 = $0; a = $3; b += $4; getline
                line22 = $0; c = $3; d += $4; getline
                line33 = $0; e = $3; f += $4
                    $2 == 3 {
                        line111 = $0; a = $3; b += $4; getline
                        line222 = $0; c = $3; d += $4; getline
                        line333 = $0; e = $3; f += $4
                    }
            }

        $0 = line11; $3 = a; $4 = b/3; print
        $0 = line22; $3 = c; $4 = d/3; print
        $0 = line33; $3 = e; $4 = f/3
    }
    {print}

Can you help me understanding how to correct my script (I think I have shortcomings with the philosophy of awk) or to start a completly new script because there is an easier solution out there ;-)

I also tried another idea:

BEGIN { FS="," ; OFS = "," }
    i=0;
    do {
        i++;
        a; b; c; d; e; f = 0
        $2 == 3 {
        line1 = $0; a = $3; b += $4; getline
        line2 = $0; c = $3; d += $4; getline
        line3 = $0; e = $3; f += $4
    }while(i<3)

        $0 = line1; $3 = a; $4 = b/3; print
        $0 = line2; $3 = c; $4 = d/3; print
        $0 = line3; $3 = e; $4 = f/3
    }
    {print}

This one also does not work, awk gives me two syntax errors (one at the "do" and the other after the "$$2 == 3").

I changed and tried a lot in both scripts and at some point they ran without errors but they did not give the desired output at all, so I thought there has to be a general problem.

I hope you can help me, that would be really nice!

MacA
  • 25
  • 5
  • `a; b; c; d; e; f = 0 ` this is incorrect syntax, use `=` instead `a=b=c=d=e=f=0` –  Dec 07 '14 at 21:00
  • you can't nest blocks like (I think) you have done in your first sample code, ie. the 2nd time you want to test `$2 == 3` it has to be in an `if` block, i.e.` if ($2 == 3) { cmd; cmd2; cmd3; ... }`. As you're repeating the same code 3 levels, consider making it a function that you can call whenever you need it. Good luck. – shellter Dec 08 '14 at 03:28

1 Answers1

2

Normalize your input

If you normalize your input using the right tools, then the task of finding a solution is far easier

My idea is to use awk to select the records where $2==3 and then use sort to group the data on the numerical value of the third column

% echo '12,3,0.12,965.814
11,3,0.22,4313.2
14,3,0.42,7586.22
17,4,0,0
11,4,0,0
15,4,0,0
13,4,0,0
17,4,0,0
11,4,0,0
18,3,0.12,2764.86
12,3,0.22,2058.3
11,3,0.42,2929.62
10,4,0,0
10,4,0,0
14,4,0,0
12,4,0,0
19,3,0.12,1920.64
20,3,0.22,1721.51
12,3,0.42,1841.55
11,4,0,0
15,4,0,0
19,4,0,0
11,4,0,0
13,4,0,0
17,3,0.12,2738.99
12,3,0.22,1719.3
18,3,0.42,3757.72' | \
awk -F, '$2==3' | \
sort --field-separator=, --key=3,3 --numeric-sort --stable
12,3,0.12,965.814
18,3,0.12,2764.86
19,3,0.12,1920.64
17,3,0.12,2738.99
11,3,0.22,4313.2
12,3,0.22,2058.3
20,3,0.22,1721.51
12,3,0.22,1719.3
14,3,0.42,7586.22
11,3,0.42,2929.62
12,3,0.42,1841.55
18,3,0.42,3757.72
% 

Reason on normalized input

As you can see, the situation is now much clearer and we can try to design an algorithm to output a 3-elements running mean.

% awk -F, '$2==3' YOUR_FILE | \
sort --field-separator=, --key=3,3 --numeric-sort --stable | \
awk -F, '                                      
    $3!=prev {prev=$3
              c=0
              s[1]=0;s[2]=0;s[3]=0}
             {old=new
              new=$0
              c = c+1; i = (c-1)%3+1; s[i] = $4
              if(c>2)print old FS (s[1]+s[2]+s[3])/3}'
18,3,0.12,2764.86,1883.77
19,3,0.12,1920.64,2474.83
12,3,0.22,2058.3,2697.67
20,3,0.22,1721.51,1833.04
11,3,0.42,2929.62,4119.13
12,3,0.42,1841.55,2842.96

Oops,

I forgot your requirement on SUBSTITUTING $4 with the running mean, I will come out with a solution unless you're faster than me...

Edit: change the line

             {old=new

to

             {split(new,old,",")

and change the line

              if(c>2)print old FS (s[1]+s[2]+s[3])/3}'

to

              if(c>2) print old[1] FS old[2] FS old[3] FS (s[1]+s[2]+s[3])/3}'
gboffi
  • 22,939
  • 8
  • 54
  • 85
  • This looks very good! Thanks a lot! I wonder if there is any chance to get the result in the same order than the input? But this might be difficult since we have to "store" the order somehow... – MacA Dec 08 '14 at 07:38
  • I don't understand _the result in the same order than the input_, for your request the running means are computed for specific values of `$3` and so in my solution they are grouped by `$3`... – gboffi Dec 08 '14 at 08:55
  • You can do as follows, in the first pass prepend the record number to each record `awk -F, '$2==3{print NR FS $0}'`, change all the references to the fields in the following sections (remember to output also the original record number when you print a running mean) and post process the output to _1)_ sort on the original record number, _2)_ remove the original record number. – gboffi Dec 08 '14 at 09:03
  • Yes, I understood that. I will give an example, if we take the output in your post, then I would like to have the order line1, line3, line5, line2, line4, line6 – MacA Dec 08 '14 at 09:05