AWK - Using arrays to count by hour and unique value

Question

I have the following input file:

Unit1 15 00:20:58
Unit1 30 01:10:00
Unit3 10 00:20:15
Unit2 5  00:45:00
Unit3 20 00:30:00
Unit2 2  01:22:35
Unit2 3  01:35:22
Unit1 5  00:58:20

For some background on this input file. It is a list of work Units for an e-portal that I have been tasked with analyzing. In the log file it provides the Unit name ($1) as well as the total number of questions that a student has completed ($2) before hitting submit which records the time ($3),tweaked to allow for a clearer example.

I would like to output the following:

Unit1
---------------------
00
========
20
--------
01 
========
30
--------

Unit2
---------------------
00
========
5
--------
01 
========
5
--------

Unit3
---------------------
00
========
30
--------

the Code I have currently is as follows:

#!/usr/bin/gawk -f

{ #Start of MID
        key = $1 #Message Extracted 10 Total
        key2 = substr($3,1,2) #Hour
        MSG_TYPE[key]++ #Distinct Message
        HOUR_AR[key2]++
        HT_AR[key2] += $2 #Tots up the total for each message by hour

} #End of MID
END {
                for (MSG in MSG_TYPE) {
                        print MSG
                        print "-----------------------------------"
                n=asorti(HOUR_AR, HOUR_SOR)
                for (i = 1; i <= n; i++) {
                            print HOUR_SOR[i]
                            print "========="
                            print HOUR_AR[HOUR_SOR[i]]
                            print "---------"
                            }
                            print "\n"
                    }
    } #End of END

The logic behind this code is that it get's all the unique values from $1 with the MSG_TYPE[]. This is then scanned in a for loop and prints out each value. The hour is collected by the HOUR_AR[] array and it sorted and then for each pass of the MSG for loop returns,hopefully, all the hours for that particular MSG and then it prints a sum of $2 for that hour AND MSG.

I am sorry this is long winded. Just wanted to provide enough detail. Any and all help is greatly appreciated.

together with showing your code, you should also mention what is the logic of it. Where are all those 00, 20, 01 and 30 coming from in Unit1? — fedorqui, Jun 23 '16 at 08:21
Your edit still doesn't explain where are all those numbers coming from. — fedorqui, Jun 23 '16 at 08:37
Maybe I am a bit dumb, but to me this still doesn't make sense: we don't want to know what is the content of the files, but what logic or algorithm you apply to generate an output on the form of unit1 / 00 / 20 / 01 / 30. — fedorqui, Jun 23 '16 at 08:46
Don't use all upper case for variable names in awk or in shell (unless exported) to avoid clashing with builtin variables and obfuscating your code by making your code look like it's using builtin variables when it's not. — Ed Morton, Jun 23 '16 at 11:57

score 2 · Accepted Answer · answered Jun 23 '16 at 09:20

2

for the given example, this codes gave output as you expected:

 awk -F'[ :]+' '{u[$1][$3]+=$2}
     END{for(i in u){
            print i;print "--------";
            for(j in u[i])
               print j"\n====\n"u[i][j]"\n---"}}' file

it outputs:

Unit1
--------
00
====
20
---
01
====
30
---
Unit2
--------
00
====
5
---
01
====
5
---
Unit3
--------
00
====
30
---

Note the sorting part is not done in codes. But you got the idea, you can make the implementation easier if you used gnu awk's array of array.

https://www.gnu.org/software/gawk/manual/html_node/Arrays-of-Arrays.html#Arrays-of-Arrays

answered Jun 23 '16 at 09:20

Kent

189,393
32
233
301

Would you be able to explain what it is this code is doing just so I can further understand how you got to this answer? – glly Jun 23 '16 at 09:23
it uses space or column as separator, so that the values we need are easier to get. then used arrays of arrays, I added link, if you don't understand how `u[i][j]` worked, you can read the linked doc. The logic is simple, just two nested for-loop and print. – Kent Jun 23 '16 at 09:25
when I try to implement `u[i][j]` I am getting a syntax error on the `[j]` part. Any idea why? – glly Jun 23 '16 at 09:52
No no idea... In my work I had an array problem on `a[i]` do you have any idea why? – Kent Jun 23 '16 at 09:55
Touché. the code I am trying to implement it in is as follows: `key = $1 key2 = $3 MSG_HR[key][key2] += $2 ` – glly Jun 23 '16 at 10:13
Never mind I have found out that I am using a old version of `awk` that doesn't support `a[i][j]` format. I'll have to use `a[i,j]`. – glly Jun 23 '16 at 10:29
@GlennLynam get a newer version of GNU awk (4.0 or more recent) as you are missing a TON of extremely useful functionality (see https://www.gnu.org/software/gawk/manual/gawk.html#Feature-History). – Ed Morton Jun 23 '16 at 12:00
@EdMorton I have requested to see if this would be possible but it will require a CAB meeting for the business. I have had to re-write the above by adapting the following:http://stackoverflow.com/questions/14280877/multidimensional-arrays-in-awk But I can confirm that the above will work on my local machine as I have installed the latest `awk` – glly Jun 23 '16 at 12:02
1

Whatever it takes - it's worth it. For example it'd give you the ability to specify the order that `for (i in array)` visits your array indices instead of them being "random" (hash order) as they will be with your current awk. – Ed Morton Jun 23 '16 at 12:10
hmmm that would be useful – glly Jun 23 '16 at 13:11

AWK - Using arrays to count by hour and unique value

1 Answers1