If first two columns are equal, select top 3 based on descending order of 3rd column

Question

I want to select top 3 results for every line that has the same first two column.

For example the data will look like,

cat data.txt
A    A    10
A    A    1
A    A    2
A    A    5
A    A    8
A    B    1
A    B    2
A    C    6
A    C    5
A    C    10
A    C    1
B    A    1
B    A    1
B    A    2
B    A    8

And for the result I want

A    A    10
A    A    8
A    A    5
A    B    2
A    B    1
A    C    10
A    C    6
A    C    5
B    A    1
B    A    1
B    A    2

Note that some of the "groups" do not contain 3 rows.

I have tried

sort -k1,1 -k2,2 -k3,3nr data.txt | sort -u -k1,1 -k2,2 > 1.txt 
comm -23 <(sort data.txt) <(sort 1.txt)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 2.txt 
comm -23 <(sort data.txt) <(cat 1.txt 2.txt | sort)| sort -k1,1 -k2,2 -k3,3nr| sort -u -k1,1 -k2,2 > 3.txt

It seems like it's working but since I am learning to code better was wondering if there was a better way to go about this. Plus, my code will generate many files that I will have to delete.

Just add `rm` statements to your script. – Mad Physicist Jul 18 '17 at 18:04 — Mad Physicist, Jul 18 '17 at 18:04
Shouldn't the top 3 for `B A` be 8, 2, 1 and not 1, 1, 2 ? – janos Jul 18 '17 at 18:12 — janos, Jul 18 '17 at 18:12
@janos my mistake. you are right. – palansuya Jul 18 '17 at 21:48 — palansuya, Jul 18 '17 at 21:48

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

You can do:

$ sort -k1,1 -k2,2 -k3,3nr file | awk 'a[$1,$2]++<3'
A    A    10
A    A    8
A    A    5
A    B    2
A    B    1
A    C    10
A    C    6
A    C    5
B    A    8
B    A    2
B    A    1

Explanation:

There are two key items to understand the awk program; associative arrays and fields.

If you reference an empty awk array element, it is an empty container -- ready for anything you put into it. You can use that as a counter.

You state If first two columns are equal...

The sort puts the file in order desired. The statement a[$1,$2] uses the values of the first two fields as a unique entry into an associative array.

You then state ...select top 3 based on descending order of 3rd column...

Once again, the sort put the file into the desired order, and the statement a[$1,$2]++ counts them. Now just count up to three.

awk is organized into blocks of condition {action} The statement a[$1,$2]++<3 is true until there are more than 3 of the same pattern seen.

A wordier version of the program would be:

awk 'a[$1,$2]++<3 {print $0}'

But the default action if the condition is true is to print $0 so it is not needed.

If you are processing text in Unix, you should get to know awk. It is the most powerful tool that POSIX guarantees you will have, and is commonly used for these tasks.

Great place to start is the online book Effective AWK Programming by Arnold D. Robbins

Very cool (+1), but instead of `a[$1,$2]` wouldn't it be safer to write `a[$1 FS $2]` ? — janos, Jul 18 '17 at 18:49
@janos, not really: the `,` in the array index is actually the [awk `SUBSEP` variable](https://www.gnu.org/software/gawk/manual/html_node/User_002dmodified.html#index-SUBSEP-variable), so it's actually `a[$1 SUBSEP $2]`. The default value for that variable is `\034`, probably unlikely to be found in a file you're processing with awk. — glenn jackman, Jul 18 '17 at 18:56
@glennjackman I agree it's *unlikely*. But `FS` is *impossible* to find in any of the fields. — janos, Jul 18 '17 at 19:02
true,true. How about `BEGIN {SUBSEP=FS}` and then we can still use `a[$1,$2]` — glenn jackman, Jul 18 '17 at 19:13

score 1 · Answer 2 · answered Jul 18 '17 at 19:00

1

@Dawg has the best answer. This one will be a little lighter on memory, which probably won't be a concern for your data:

sort -k1,2 -k3,3nr file | 
awk '
    {key = $1 FS $2} 
    prev != key {prev = key; count = 1} 
    count <= 3 {print; count++}
'

answered Jul 18 '17 at 19:00

glenn jackman

238,783
38
220
352

score 0 · Answer 3 · answered Jul 18 '17 at 18:18

You can sort the file by first two columns primarily and by the 3rd one numerically secondarily, then read the output and only print the first three lines for each combination of the first two columns.

sort -k1,2 -k3,3rn data.txt \
| while read c1 c2 n ; do
    if [[ $c1 == $l1 && $c2 == $l2 ]] ; then
        ((c++))
    else
        c=0
    fi
    if (( c < 3 )) ; then
        echo $c1 $c2 $n
        l1=$c1
        l2=$c2
    fi
done

If first two columns are equal, select top 3 based on descending order of 3rd column

3 Answers3

Explanation: