2

I have a daily file output on a linux system like the below and was wondering is there a way to group the data in 30min increments based on $1 and avg $3 and sum $4 $5 $6 $7 $8 via a shell script using awk/gawk or something similar?

04:04:13 04:10:13 2.13 36 27 18 18 0
04:09:13 04:15:13 2.37 47 38 13 34 0
04:14:13 04:20:13 2.19 57 37 23 33 1
04:19:13 04:25:13 2.43 43 35 13 30 0
04:24:13 04:30:13 2.29 48 40 19 28 1
04:29:13 04:35:13 2.33 56 42 16 40 0
04:34:13 04:40:13 2.21 62 47 30 32 0
04:39:13 04:45:13 2.25 44 41 19 25 0
04:44:13 04:50:13 2.20 65 50 32 33 0
04:49:13 04:55:13 2.47 52 38 16 36 0
04:54:13 05:00:13 2.07 72 54 40 32 0
04:59:13 05:05:13 2.35 53 41 19 34 0

so basically this hour of data would result in something like this:

04:04:13-04:29:13 2.29 287 219 102 183 2 
04:34:13-04:59:13 2.25 348 271 156 192 0

this is what I have gotten so far using awk to search between the time frames but I think there is an easier way to get the grouping done without awking each 30min interval

awk '$1>=from&&$1<=to' from="04:00:00" to="04:30:00" | awk '{ total += $3; count++ } END { print total/count }'|awk '{printf "%0.2f\n", $1'}

awk '$1>=from&&$1<=to' from="04:00:00" to="04:30:00" | awk '{ sum+=$4} END {print sum}'
user1999357
  • 113
  • 1
  • 2
  • 11

2 Answers2

0

This should do what you want:

{
    split($1, times, ":");
    i = (2 * times[1]);
    if (times[2] >= 30) i++;
    if (!start[i] || $1 < start[i]) start[i] = $1;
    if (!end[i] || $1 > end[i]) end[i] = $1;
    count[i]++;
    for (col = 3; col <= 8; col++) {
        data[i, col] += $col;
    }
}

END {
    for (i = 1; i <= 48; i++) {
        if (start[i]) {
            data[i, 3] = data[i, 3] / count[i];
            printf("%s-%s %.2f", start[i], end[i], data[i, 3]);
            for (col = 4; col <= 8; col++) {
                printf(" " data[i, col]);
            }
            print "";
        }
    }
}

As you can see, I divide the day into 48 half-hour intervals and place the data into one of these bins depending on the time in the first column. After the input has been exhausted, I print out all bins that are not empty.

Michael Vehrs
  • 3,293
  • 11
  • 10
0

Personally, I would do this in Python or Perl. In awk, the arrays are not ordered (well, in gawk you could use assorti to sort the array...) which makes printing ordered buckets more work.

Here is the outline:

  1. Read input
  2. Convert the time stamp to seconds
  3. Add to an ordered (or sortable) associative array of the data elements in buckets of the desired time frame (or, just keep running totals).
  4. After the data is read, process as you wish.

Here is a Python version of that:

#!/usr/bin/python

from collections import OrderedDict
import fileinput
times=[]
interval=30*60
od=OrderedDict()

for line in fileinput.input():
    li=line.split()
    secs=sum(x*y for x,y in zip([3600,60,1], map(int, li[0].split(":"))))
    times.append([secs, [li[0], float(li[2])]+map(int, li[3:])])

current=times[0][0]
for t, li in times:
    if t-current<interval:
        od.setdefault(current, []).append(li)
    else:
        current=t
        od.setdefault(current, []).append(li)

for s, LoL in od.items():
    avg=sum(e[1] for e in LoL)/len(LoL)
    sums=[sum(e[i] for e in LoL) for i in range(2,7)]
    print "{}-{} {:.3} {}".format(LoL[0][0], LoL[-1][0], avg, ' '.join(map(str, sums))) 

Running that on your example data:

$ ./ts.py ts.txt
04:04:13-04:29:13 2.29 287 219 102 183 2
04:34:13-04:59:13 2.26 348 271 156 192 0    

The advantage is you can easily change the interval and a similar technic can use timestamps that are longer than days.


If you really want awk you could do:

awk 'BEGIN{ interval=30*60 }
   function fmt(){
          line=sprintf("%s-%s %.2f %i %i %i %i %i", ls, $1, sums[3]/count, 
                             sums[4], sums[5], sums[6], sums[7], sums[8])
   }
   {
       split($1,a,":") 
       secs=a[1]*3600+a[2]*60+a[3]
       if (NR==1) { 
            low=secs
            ls=$1
            count=0
            for (i=3; i<=8; i++)
               sums[i]=0
        }                       
        for (i=3; i<=8; i++){
           sums[i]+=$i
        }
        count++         
        if (secs-low<interval) {
            fmt()
            }       
        else {
            print line
            low=secs
            ls=$1
            count=1
            for (i=3; i<=8; i++)
               sums[i]=$i                           
            }                   
   }
   END{
        fmt()
        print line
   }' file
04:04:13-04:29:13 2.29 287 219 102 183 2
04:34:13-04:59:13 2.26 348 271 156 192 0
dawg
  • 98,345
  • 23
  • 131
  • 206