Compute percentile and max value per variable

Question

Bash Gurus, I need to compute the max and percentile numbers for each item in the list, using awk

aa  1
ab  3
aa  4
ac  5
aa  3
ad  2
ab  4
ac  2
ae  2
ac  5

Expected output

Item   90th percentile   max value
aa     3.8             4
ab     3.9             4
ac     5               5
ad     2               2
ae     2               2

Am able to get the sum and max using the below, but not the percentile.

awk '{
item[$1]++;
count[$1]+=$2;
max[$1]=$2;
percentile[$1,.9]=$2
 }
 END{
 for (var in item)
 print var,count[var],max[var],percentile[var]
 }
'

Please suggest.

What method do you use to calculate the percentile ? Linear Interpolation ? Nearest Rank ? Have you implemented a function to do so in bash ? — Aserre, Nov 09 '16 at 10:25
@PradeepBS yes I am. Are you looking for help to figure out how to calculate percentiles or how to implement that calculation in awk? If the former then you are doing the right thing by just stating you want percentiles and leaving it up to others to do everything else from there, but if the latter then showing your algorithm would go a long way to encouraging awk experts to help you implement it in awk otherwise the number of people who'll want to help you will be limited and you may end up with something that produces the output you want but is far from a good solution. — Ed Morton, Nov 10 '16 at 15:39

James Brown · Accepted Answer · 2016-11-09T13:27:48.863

Percentile calculation from Statistics for Dummies 2nd ed. :). In Gnu awk:

$ cat mnp.awk
BEGIN {
    PROCINFO["sorted_in"]="@ind_num_asc"   # for order in output
    if(p=="")                              # if p not defined it's median
        p=0.5
    else
        p=p/100                            # if 90th percentile: p=0.9
}
{
    v[$1][NR]=$2                           # values stored per keyword. NR for unique
    if($2>m[$1])                           # find max val
        m[$1]=$2
}
END {
    for(i in v) {                          # for all keywords
        n=asort(v[i])                      # sort values, n is count
        prc=p*n;                           # percentile figuration
        if(prc==int(prc))
            w=(v[i][prc]+v[i][prc+1])/2
        else
            w=v[i][int(prc)+1]
        print i, m[i], w                   # print keyword, max and nth value
    }
}

Run it:

$ awk -p=90 -f mnp.awk data.txt
aa 4 4
ab 4 4
ac 5 5
ad 2 2
ae 2 2

TODO: if the data file was sorted, this could be streamlined and not all data would need to be stored to memory.

@dood Yeah. I wish OP would've given the definition of percentile he would've wanted. Quoting Wikipedia's page on percentile: _There is no standard definition of percentile, however all definitions yield similar results when the number of observations is very large_. The definition I used came from _Statistics for Dummies_ 2nd ed. — James Brown, Nov 10 '16 at 22:16

score 1 · Answer 2 · answered Nov 09 '16 at 13:10

datamash is a lovely tool, although it doesn't support the percantile part.

$ datamash -W --sort --group=1 max 2 min 2 < INPUT
aa  4   1
ab  4   3
ac  5   2
ad  2   2
ae  2   2

It supports the following operations

File operations:
  transpose, reverse
Numeric Grouping operations:
  sum, min, max, absmin, absmax
Textual/Numeric Grouping operations:
  count, first, last, rand 
  unique, collapse, countunique
Statistical Grouping operations:
  mean, median, q1, q3, iqr, mode, antimode
  pstdev, sstdev, pvar, svar, mad, madraw
  pskew, sskew, pkurt, skurt, dpo, jarque

score 0 · Answer 3 · answered Nov 09 '16 at 12:55

0

Here is an elegant solution I found floating around the internet for finding the max value:

{
  max[$1] = !($1 in max) ? $2 : ($2 > max[$1]) ? $2 : max[$1]
}
END {
  for (i in max)
    print i, max[i]
}

Output:

ab 4
ac 5
ad 2
ae 2
aa 4

answered Nov 09 '16 at 12:55

Jacek Trociński

882
1
8
23

1

You'd just do `max[$1] = ( ($1 in max) && (max[$1] > $2) ? max[$1] : $2 )` to avoid negative (`!`) and repeating (setting it to `$2` in multiple locations) syntax. – Ed Morton Nov 10 '16 at 17:25

Compute percentile and max value per variable

3 Answers3