sort and uniquize a tab separated file for two columns

Question

i have a very large tab separated file, a part of which looks like this:

33  x   171 297 126
4   x   171 300 129
2   x   171 303 132
11  y   163 289 126
5   y   163 290 127
3   y   163 291 128
2   y   163 292 129
2   y   170 289 119
2   z   166 307 141
2   z   166 308 142
6   z   166 309 143
4   z   166 329 163
2   z   166 330 164

i want to sort and select only one line for each: x,y, z based on the highest value associated with it in the first column (in unix)

So you would expect `33 x ...`, `11 y ...`, and `6 z ....` ? — Hunter McMillen, May 02 '17 at 18:29
Try this: `perl -lanE '($v,$k)=@F[0..1];$h{$k}=$_,$j{$k}=$v if $j{$k}<$v;END{say for values %h}' file` — Håkon Hægland, May 02 '17 at 19:09
Please show the expected output for the input you use as illustration. — Jonathan Leffler, May 02 '17 at 19:53
Use the Linux `sort` tool to sort by column 2 and column 1 in descending order. That will give you a list with all the x's together with the highest number on top, all the y's together with the highest on top, etc. Then `uniq --skip-fields=1` will take the first line of each group, giving you the output you desire. — Jim Mischel, May 03 '17 at 12:36

score 1 · Answer 1 · answered May 02 '17 at 18:46

1

You can do this with awk:

awk '
{
  key = $2;
  flag = 0;
  if (key in value) { max = value[key] ; flag = 1 };
  if (flag == 0 || max < $1) { value[key] = $1; line[key] = $0 };
}
END {
  for (key in line) { print line[key] };
}
' data.tsv

answered May 02 '17 at 18:46

Andrey

2,503
3
30
39

You don't need the flag and max. Remove line 2 and 3 in the first block, and change the if in line 4 to `if ( value[key] < $1)`. – ULick May 02 '17 at 21:49
@ULick if the first column contains negative numbers, your version might not work properly (default values are 0 and ""). – Andrey May 02 '17 at 23:10
True. Can be solved by `if ( ! (key in value) || $1 > value[key] )`. Still no flag, but loosing readability. – ULick May 03 '17 at 17:06
this works perfect \nperl -lanE '($v,$k)=@F[0..1];$h{$k}=$_,$j{$k}=$v if $j{$k}<$v;END{say for values %h}' file – kaur May 03 '17 at 19:52

sort and uniquize a tab separated file for two columns

1 Answers1