3

Bash Gurus, I need to compute the max and percentile numbers for each item in the list, using awk

aa  1
ab  3
aa  4
ac  5
aa  3
ad  2
ab  4
ac  2
ae  2
ac  5

Expected output

Item   90th percentile   max value
aa     3.8             4
ab     3.9             4
ac     5               5
ad     2               2
ae     2               2

Am able to get the sum and max using the below, but not the percentile.

awk '{
item[$1]++;
count[$1]+=$2;
max[$1]=$2;
percentile[$1,.9]=$2
 }
 END{
 for (var in item)
 print var,count[var],max[var],percentile[var]
 }
' 

Please suggest.

Pradeep BS
  • 51
  • 1
  • 7
  • 1
    What are you expecting `percentile[$1,.9]=$2` to do? – Tom Fenech Nov 09 '16 at 10:23
  • 3
    What method do you use to calculate the percentile ? Linear Interpolation ? Nearest Rank ? Have you implemented a function to do so in bash ? – Aserre Nov 09 '16 at 10:25
  • @jas they are 1, 3, and 4. – Jacek Trociński Nov 09 '16 at 11:10
  • Ups, I was blind :-(, thanks, @dood! – jas Nov 09 '16 at 11:13
  • @TomFenech - hoping to get to 90th percentile. – Pradeep BS Nov 09 '16 at 11:51
  • @EdMorton Aren't you aware of percentiles? – Pradeep BS Nov 10 '16 at 04:11
  • @PradeepBS yes I am. Are you looking for help to figure out how to calculate percentiles or how to implement that calculation in awk? If the former then you are doing the right thing by just stating you want percentiles and leaving it up to others to do everything else from there, but if the latter then showing your algorithm would go a long way to encouraging awk experts to help you implement it in awk otherwise the number of people who'll want to help you will be limited and you may end up with something that produces the output you want but is far from a good solution. – Ed Morton Nov 10 '16 at 15:39

3 Answers3

2

Percentile calculation from Statistics for Dummies 2nd ed. :). In Gnu awk:

$ cat mnp.awk
BEGIN {
    PROCINFO["sorted_in"]="@ind_num_asc"   # for order in output
    if(p=="")                              # if p not defined it's median
        p=0.5
    else
        p=p/100                            # if 90th percentile: p=0.9
}
{
    v[$1][NR]=$2                           # values stored per keyword. NR for unique
    if($2>m[$1])                           # find max val
        m[$1]=$2
}
END {
    for(i in v) {                          # for all keywords
        n=asort(v[i])                      # sort values, n is count
        prc=p*n;                           # percentile figuration
        if(prc==int(prc))
            w=(v[i][prc]+v[i][prc+1])/2
        else
            w=v[i][int(prc)+1]
        print i, m[i], w                   # print keyword, max and nth value
    }
}

Run it:

$ awk -p=90 -f mnp.awk data.txt
aa 4 4
ab 4 4
ac 5 5
ad 2 2
ae 2 2

TODO: if the data file was sorted, this could be streamlined and not all data would need to be stored to memory.

James Brown
  • 36,089
  • 7
  • 43
  • 59
  • This is not the expected output OP posted. – Jacek Trociński Nov 10 '16 at 18:33
  • 1
    @dood Yeah. I wish OP would've given the definition of percentile he would've wanted. Quoting Wikipedia's page on percentile: _There is no standard definition of percentile, however all definitions yield similar results when the number of observations is very large_. The definition I used came from _Statistics for Dummies_ 2nd ed. – James Brown Nov 10 '16 at 22:16
1

datamash is a lovely tool, although it doesn't support the percantile part.

$ datamash -W --sort --group=1 max 2 min 2 < INPUT
aa  4   1
ab  4   3
ac  5   2
ad  2   2
ae  2   2

It supports the following operations

File operations:
  transpose, reverse
Numeric Grouping operations:
  sum, min, max, absmin, absmax
Textual/Numeric Grouping operations:
  count, first, last, rand 
  unique, collapse, countunique
Statistical Grouping operations:
  mean, median, q1, q3, iqr, mode, antimode
  pstdev, sstdev, pvar, svar, mad, madraw
  pskew, sskew, pkurt, skurt, dpo, jarque
Fredrik Pihl
  • 44,604
  • 7
  • 83
  • 130
0

Here is an elegant solution I found floating around the internet for finding the max value:

{
  max[$1] = !($1 in max) ? $2 : ($2 > max[$1]) ? $2 : max[$1]
}
END {
  for (i in max)
    print i, max[i]
}

Output:

ab 4
ac 5
ad 2
ae 2
aa 4
Jacek Trociński
  • 882
  • 1
  • 8
  • 23
  • 1
    You'd just do `max[$1] = ( ($1 in max) && (max[$1] > $2) ? max[$1] : $2 )` to avoid negative (`!`) and repeating (setting it to `$2` in multiple locations) syntax. – Ed Morton Nov 10 '16 at 17:25