How to find the X highest values in a list depending on the values of another list with bash/awk?

Question

I found a bit of answers with this topic on this forum : How to find the X lowest values in a list with bash/awk?

But my problem is a little bit different.

This is an exemple of inpout.txt file (a "," delimited text file which can be open as csv file):

ID, Code, Value, Store
SP|01, AABBCDE, 15, 3 
SP|01, AABBCDE, 14, 2 
SP|01, AABBCDF, 13, 2
SP|01, AABBCDE, 16, 3 
SP|02, AABBCED, 15, 2 
SP|01, AABBCDF, 12, 3
SP|01, AABBCDD, 13, 2 
SP|02, AABBCDF, 9, 2
SP|01, AABBCDF, 8, 3

The idea is to print rows with the 2 highest values in "Value" column for identical values in "Code" (And also keep the headers)

Exemple of output files:

ID Code Value Store
SP|01, AABBCDE, 16, 3
SP|01, AABBCDE, 15, 3
SP|02, AABBCED, 15, 2
SP|01, AABBCDD, 13, 2
SP|01, AABBCDF, 13, 2
SP|01, AABBCDF, 12, 3

I'm new to Linux and have a bit of knowledge in very basic use of AWK, sed, grep ect but I'm unsure how to manipulate the file to get the output as stated above.

Any help would be very much appreciated!

Is this a homework assignment? I'm not asking to be a douche, it just seems like a very contrived problem. — Staven, Dec 09 '13 at 05:13
No I understand ^^, It is not a homework. but in my job i have to process data and filter them all day (i usually do that on excel however i begin to love and use more and more the Terminal on Linux and try to develop skill on it to make my life easier) I have already made some basic script for data processing however i never found how to make what i want this time with awk — user3081482, Dec 09 '13 at 05:32

score 0 · Answer 1 · edited May 23 '17 at 10:32

0

awk -F "," '
# it's the header
NR == 1 { header=$0 ; OFS=FS ; next }
# check if we stored the code already
! ( $2 in codes ) { codes[$2]=1 }
# store the max and line shift the previous max, skip to next line
$3 >= $2[1] { 
    $2[2] =  $2[1] ; 
    $2[4] = $2[3] ; 
    $2[3] = $0 ; 
    $2[1] = $3 ; 
    next 
}
# store the second greatest value
$3 > $2[2] { 
    $2[2] =  $3 ; 
    $2[4] = $0 
}
# finally print the maxes
END {
print header
for (c in codes) {
    print c[3]
    print c[4]
    }
}' INPUTFILE

Note, this prints the codes in random order, but the value lines (in the codes array) will be sorted (desc.).

UPDATE: the problem with the above code is that GAwk does not support dynamic array names. It can be solved with multi dimensional arrays which GAwk supports in a strange manner, see here: http://objectmix.com/awk/27035-dynamic-array-names.html and Awk array iteration for multi-dimensional arrays and the official docs: http://www.gnu.org/software/gawk/manual/html_node/Multi_002ddimensional.html . I'd recommend going that way.

edited May 23 '17 at 10:32

Community

1
1

answered Dec 09 '13 at 05:32

Zsolt Botykai

50,406
14
85
110

Thanks for your answer, Actually my idea was to be able to get the 10 highest values or more (I have made a really simple exemple to get the idea and then adapt to my problem) (can be text file with 10 column or more with thousands of rows). So your idea seems great but is there a way or a syntax to be able to modify the "X" value to modify the X highest values wanted? – user3081482 Dec 09 '13 at 05:54
First read this: http://stackoverflow.com/help/reputation . You can adapt my code to 10 (max) rows with + ~8 line and adding 2 in the END block with deleting one there. But first show us what have you tried yet? – Zsolt Botykai Dec 09 '13 at 05:57
awk -F "," ' NR == 1 { header=$0 ; OFS=FS ; next } ! ( $2 in codes ) { codes[$2]=1 } $3 >= $2[1] { $2[2] = $2[1] ; $2[4] = $2[3] ; $2[3] = $0 ; $2[1] = $3 ; next } $3 > $2[2] { $2[2] = $3 ; $2[4] = $0 } $3 > $2[3] { $2[2] = $3 ; $2[4] = $0 } $3 > $2[4] { $2[2] = $3 ; $2[4] = $0 } $3 > $2[5] { $2[2] = $3 ; $2[4] = $0 } $3 > $2[6] { $2[2] = $3 ; $2[4] = $0 } $3 > $2[7] { $2[2] = $3 ; $2[4] = $0 } $3 > $2[8] { $2[2] = $3 ; $2[4] = $0 } $3 > $2[9] { $2[2] = $3 ; $2[4] = $0 } END { print header for (c in codes) { print c[3] print c[4] }}' Inpout.txt > output_test.txt – user3081482 Dec 09 '13 at 07:35
First what were your error? Second: I'd recommend pasting your code in e.g. https://gist.github.com/ – Zsolt Botykai Dec 09 '13 at 08:53
Thanks for the advice, i have changed few things in the script https://gist.github.com/anonymous/07df0446eef53f9f0b42 but keep going to have $3 >= $2[2] { [2]: command not found ! ( $2 in Code ) { Code[$2]=1 } bash: syntax error near unexpected token `{' – user3081482 Dec 10 '13 at 06:36
OK, I had recognized the problem. GAwk does not support dynamic array names. It can be solved with multi dimensional arrays which GAwk supports in a strange manner, see here: http://objectmix.com/awk/27035-dynamic-array-names.html and http://stackoverflow.com/questions/3060600/awk-array-iteration-for-multi-dimensional-arrays and the official docs: http://www.gnu.org/software/gawk/manual/html_node/Multi_002ddimensional.html . I'd recommend going that way. – Zsolt Botykai Dec 10 '13 at 12:34

How to find the X highest values in a list depending on the values of another list with bash/awk?

1 Answers1