1

I am working in Bash and trying to write an Awk script that takes data from a CSV file, groups the data by rows and then get the min, max, and avg of the values.

Here is the complete CSV file:

Student,Catehory,Assignment,Score,Possible
Chelsey,Homework,H01,90,100
Chelsey,Homework,H02,89,100
Chelsey,Homework,H03,77,100
Chelsey,Homework,H04,80,100
Chelsey,Homework,H05,82,100
Chelsey,Homework,H06,84,100
Chelsey,Homework,H07,86,100
Chelsey,Lab,L01,91,100
Chelsey,Lab,L02,100,100
Chelsey,Lab,L03,100,100
Chelsey,Lab,L04,100,100
Chelsey,Lab,L05,96,100
Chelsey,Lab,L06,80,100
Chelsey,Lab,L07,81,100
Chelsey,Quiz,Q01,100,100
Chelsey,Quiz,Q02,100,100
Chelsey,Quiz,Q03,98,100
Chelsey,Quiz,Q04,93,100
Chelsey,Quiz,Q05,99,100
Chelsey,Quiz,Q06,88,100
Chelsey,Quiz,Q07,100,100
Chelsey,Final,FINAL,82,100
Chelsey,Survey,WS,5,5
Sam,Homework,H01,19,100
Sam,Homework,H02,82,100
Sam,Homework,H03,95,100
Sam,Homework,H04,46,100
Sam,Homework,H05,82,100
Sam,Homework,H06,97,100
Sam,Homework,H07,52,100
Sam,Lab,L01,41,100
Sam,Lab,L02,85,100
Sam,Lab,L03,99,100
Sam,Lab,L04,99,100
Sam,Lab,L05,0,100
Sam,Lab,L06,0,100
Sam,Lab,L07,0,100
Sam,Quiz,Q01,91,100
Sam,Quiz,Q02,85,100
Sam,Quiz,Q03,33,100
Sam,Quiz,Q04,64,100
Sam,Quiz,Q05,54,100
Sam,Quiz,Q06,95,100
Sam,Quiz,Q07,68,100
Sam,Final,FINAL,58,100
Sam,Survey,WS,5,5
Andrew,Homework,H01,25,100
Andrew,Homework,H02,47,100
Andrew,Homework,H03,85,100
Andrew,Homework,H04,65,100
Andrew,Homework,H05,54,100
Andrew,Homework,H06,58,100
Andrew,Homework,H07,52,100
Andrew,Lab,L01,87,100
Andrew,Lab,L02,45,100
Andrew,Lab,L03,92,100
Andrew,Lab,L04,48,100
Andrew,Lab,L05,42,100
Andrew,Lab,L06,99,100
Andrew,Lab,L07,86,100
Andrew,Quiz,Q01,25,100
Andrew,Quiz,Q02,84,100
Andrew,Quiz,Q03,59,100
Andrew,Quiz,Q04,93,100
Andrew,Quiz,Q05,85,100
Andrew,Quiz,Q06,94,100
Andrew,Quiz,Q07,58,100
Andrew,Final,FINAL,99,100
Andrew,Survey,WS,5,5
Ava,Homework,H01,55,100
Ava,Homework,H02,95,100
Ava,Homework,H03,84,100
Ava,Homework,H04,74,100
Ava,Homework,H05,95,100
Ava,Homework,H06,84,100
Ava,Homework,H07,55,100
Ava,Lab,L01,66,100
Ava,Lab,L02,77,100
Ava,Lab,L03,88,100
Ava,Lab,L04,99,100
Ava,Lab,L05,55,100
Ava,Lab,L06,66,100
Ava,Lab,L07,77,100
Ava,Quiz,Q01,88,100
Ava,Quiz,Q02,99,100
Ava,Quiz,Q03,44,100
Ava,Quiz,Q04,55,100
Ava,Quiz,Q05,66,100
Ava,Quiz,Q06,77,100
Ava,Quiz,Q07,88,100
Ava,Final,FINAL,99,100
Ava,Survey,WS,5,5
Shane,Homework,H01,50,100
Shane,Homework,H02,60,100
Shane,Homework,H03,70,100
Shane,Homework,H04,60,100
Shane,Homework,H05,70,100
Shane,Homework,H06,80,100
Shane,Homework,H07,90,100
Shane,Lab,L01,90,100
Shane,Lab,L02,0,100
Shane,Lab,L03,100,100
Shane,Lab,L04,50,100
Shane,Lab,L05,40,100
Shane,Lab,L06,60,100
Shane,Lab,L07,80,100
Shane,Quiz,Q01,70,100
Shane,Quiz,Q02,90,100
Shane,Quiz,Q03,100,100
Shane,Quiz,Q04,100,100
Shane,Quiz,Q05,80,100
Shane,Quiz,Q06,80,100
Shane,Quiz,Q07,80,100
Shane,Final,FINAL,90,100
Shane,Survey,WS,5,5

Basically, I have 5 student names and each student has completed a quiz, lab, homework for each lesson name, plus a survey and a final exam...

What I am trying to do is group this by Assignment name and generate a report that shows the lowest score achieved for that assignment, the highest score and the average score...

The output should be:

Name     Low     High  Avg
H02      66       99   74.22
L07      47       88   66.30

and include every individual assignment name from column 3 ($3). formatted using tab (/t)

The code I have pasted already outputs the headings and the 2 decimal places in avg column but the actual values are not correct.

I have only two issues really:

  1. I cannot for the life of me get the min or max for the individual groupings. I know how to get the min/max and even the basic syntax for it, but how do I get it to the individual groups?

  2. Scripting this. I have very limited experience using bash, or anything Linux for that matter and am unfamiliar with awk (though I am learning quite a bit now).

So, to get myself at least started i wrote a one liner to achieve the grouping and the output formatting I am looking for, but it is only summing the scores for each group and the average is all messed up because I still have not figured out how to get the count of the scores to use as a divisor.

Anyways, this is what I have:

awk -F "," 'BEGIN{printf "Name\tLow\tHigh\tAvg\n"}
            NR>=2{a[$3]+=$4; b[$3]+=$4;c[$3]+=$4/FNR }
            END {for (i in b) printf "%-7s\t%d\t%d\t%.02f\n", i,a[i],b[i],c[i]}'  \
    score-data.csv

The output is perfect in that it is grouping by the assignment names, 2 decimals in the avg column and tabbed.... but the low and high are not correct and the average, as you can see is messed up.. tried dividing the sum by FNR. Have also tried NF and NR both... no luck. Again, I know how to get a count, but no clue how to get it in here.

So, if anyone can help me get the min/max/avg taken care of and also with the syntax for this to be a script, it would be appreciated

I cannot comment for some reason, but I have searched google and read the man awk stuff and have two different tabs open in my browser to docs on awk. None of them address it for my situation.

As far as the array naming goes, it is all the same array being used; an associative array that uses column 3 as the index/key and values from column 4 as the key's values. all of the suggested searches and links involve columns; I need rows.

Community
  • 1
  • 1
Gene
  • 25
  • 6
  • 1
    Thank you for showing your code efforts in your question. Could you please simply add sample of input and sample of expected output in your question and let us know then. – RavinderSingh13 Jan 31 '20 at 03:22
  • I would recommend using more descriptive names for the arrays; it's bad enough having to jump back and forth from here to a spreadsheet at another website, then on top of that having to figure out what `a`, `b` and `c` refer to (and expect some confusion down the road ... weeks? months? ... when you come back to this piece of code and can't remember what the letters stand for eh) – markp-fuso Jan 31 '20 at 03:28
  • 1
    have a look at [awk: find min and max](https://stackoverflow.com/questions/29783990/awk-find-minimum-and-maximum-in-column) and [use awk to find average](https://stackoverflow.com/questions/19149731/use-awk-to-find-average-of-a-column) for some ideas; a google search on the terms `awk`, `min`, `max` and `avg` brings up quite a few hits – markp-fuso Jan 31 '20 at 03:35
  • 3
    Why is a different Gene account attempting to edit this with information which only the original poster could possess? – tripleee Jan 31 '20 at 05:09
  • Your desired output is wrong. For `H02`, the low is `47` (Andrew) and the high is `95` (Ava) and average is `74.60`. For `L07` the low is `0` (Sam) and the high is `86` (Andrew) with average `64.80`. – David C. Rankin Jan 31 '20 at 06:30

3 Answers3

2

Your problem is that your Awk script is not examining the results per key.

Try this instead.

awk -F , 'NR>1 { if(!($3 in course)) { low[$3] = high[$3] = $4 }
        if ($4 < low[$3]) low[$3] = $4;
        if ($4 > high[$3]) high[$3] = $4;
        sum[$3] += $4;
        ++course[$3] }
    END { OFS="\t"; print "Name", "Low", "High", "Avg";
        for (k in course)
          print k, low[k], high[k], sum[k]/course[k] }' file.csv

Result for your sample data:

Name    Low High    Avg
FINAL   58  99  85.6
L01 41  91  75
L02 0   100 61.4
L03 88  100 95.8
L04 48  100 79.2
L05 0   96  46.6
Q01 25  100 74.8
L06 0   99  61
Q02 84  100 91.6
L07 0   86  64.8
H01 19  90  47.8
WS  5   5   5
Q03 33  100 66.8
H02 47  95  74.6
Q04 55  100 81
H03 70  95  82.2
Q05 54  99  76.8
H04 46  80  65
Q06 77  95  86.8
H05 54  95  76.6
Q07 58  100 78.8
H06 58  97  80.6
H07 52  90  67

Calculating an average by dividing by line number only works when you want the average for the whole file (and even then of course if you are skipping some lines at the start, those should be subtracted from the divisor too).

tripleee
  • 175,061
  • 34
  • 275
  • 318
2

If you want to keep the output in order, you can do something similar to:

awk -F, '
BEGIN { printf "Name\tLow\tHigh\tAvg\n" }
NR > 1 {
    if ($3 in low) {            # if assignment already initialized
        if ($4 < low[$3])       # check new low score
            low[$3] = $4
        if ($4 > hi[$3])        # check new high score
            hi[$3] = $4
        sum[$3] += $4           # add to assignment sum
        grades[$3]++            # add to assignment score count
    }
    else {                      # new assignment name
        name[n++] = $3          # keep indexed array of names (for order)
        low[$3] = $4            # initialize low for assignment
        hi[$3]  = $4            # initialize high for assignment
        sum[$3] = $4            # initialize sum for assignment
        grades[$3] = 1          # initialize score count for assignment
    }
}
END {
    for (i=0; i<n; i++)         # output informaton in order
        printf "%s\t%d\t%d\t%.2f\n", name[i], low[name[i]], hi[name[i]], sum[name[i]]/grades[name[i]]
}' score-data.csv

The indexed array names above is used to preserve the assignment names in the order seen and then to iterate over the assignments for output in order:

Example Use/Output

Name    Low     High    Avg
H01     19      90      47.80
H02     47      95      74.60
H03     70      95      82.20
H04     46      80      65.00
H05     54      95      76.60
H06     58      97      80.60
H07     52      90      67.00
L01     41      91      75.00
L02     0       100     61.40
L03     88      100     95.80
L04     48      100     79.20
L05     0       96      46.60
L06     0       99      61.00
L07     0       86      64.80
Q01     25      100     74.80
Q02     84      100     91.60
Q03     33      100     66.80
Q04     55      100     81.00
Q05     54      99      76.80
Q06     77      95      86.80
Q07     58      100     78.80
FINAL   58      99      85.60
WS      5       5       5.00
David C. Rankin
  • 81,885
  • 6
  • 58
  • 85
2

It's not awk, but GNU Datamash is a handy tool designed just for this sort of calculation:

$ datamash -t, --header-in -g3 -s min 4 max 4 mean 4 < grades.csv \
  | awk 'BEGIN { FS=","; OFS="\t"; print "Name\tLow\tHigh\tAvg" } { $1=$1 } 1'
Name    Low     High    Avg
FINAL   58      99      85.6
H01     19      90      47.8
H02     47      95      74.6
H03     70      95      82.2
H04     46      80      65
H05     54      95      76.6
H06     58      97      80.6
H07     52      90      67
L01     41      91      75
L02     0       100     61.4
L03     88      100     95.8
L04     48      100     79.2
L05     0       96      46.6
L06     0       99      61
L07     0       86      64.8
Q01     25      100     74.8
Q02     84      100     91.6
Q03     33      100     66.8
Q04     55      100     81
Q05     54      99      76.8
Q06     77      95      86.8
Q07     58      100     78.8
WS      5       5       5

Okay, so there's an awk bit to print the desired header and convert from CSV to TSV.

This invocation says that comma is the field delimiter (-t,), that the input file has a header line, that it should be grouped and sorted on the third column (-g3 -s; datamash requires that the groups be sorted), and for each group, the minimum, maximum, and mean values of the fourth column should be calculated.

Shawn
  • 47,241
  • 3
  • 26
  • 60
  • I've been using datamash in a lot of answers lately... I swear I'm not affiliated with it; it's just so well suited for this sort of thing it's hard to resist. – Shawn Jan 31 '20 at 06:39
  • 1
    Yep, we have noticed, and the answers have been pretty damn slick. I haven't played with it much, but after your examples, my interest is peaked and will have to experiment with it further. (glad to see the answers match...) – David C. Rankin Jan 31 '20 at 06:54
  • My interest in Datamash is piqued too. – Jonathan Leffler Feb 01 '20 at 19:19