3

I have a file like this example:

NDUFAF7,0.216216,
ESRRA,0.0178571,
HS3ST1,0.027027,
HS3ST1,0,
HS3ST1,0.0833333,
ESRRA,0.214286,
NDUFAF7,0.0824742,
ESRRA,0.0810811,
NDUFAF7,0,

in which there are 2 comma separated columns. in the 1st column some rows are repeated. I want to keep only one of every repeated row based on the value in the 2nd column. in fact I want to keep the one with the biggest value in the 2nd row. the output for above example would be (which is tab separated):

NDUFAF7 0.216216
HS3ST1  0.0833333
ESRRA   0.214286

I tried the following code in awk but did not return what I want.

awk -F "," '{ if($2 >= $2) { print }}' file_name

do you know how to fix it?

Inian
  • 80,270
  • 14
  • 142
  • 161
  • 2
    Possible duplicate of [Using awk to get the maximum value of a column, for each unique value of another column](https://stackoverflow.com/questions/35069048/using-awk-to-get-the-maximum-value-of-a-column-for-each-unique-value-of-another) – RomanPerekhrest Mar 27 '18 at 07:12
  • @Allan: Why is this being tagged under `bash`, if the OP doesn't tag it intentionally? – Inian Mar 27 '18 at 07:27
  • Do you accept only `awk` solutions? Could you have a look at my answer? – Allan Mar 27 '18 at 07:38

2 Answers2

1

Solution 1st: Following awk may help you on same. It will NOT give the order of field first in same order of Input_file.

awk -F"," '{a[$1]=a[$1]>$(NF-1)?a[$1]:$(NF-1)} END{for(i in a){print i,a[i]}}' Input_file

OR a non-one liner form of above solution too here.

awk -F"," '
{
  a[$1]=a[$1]>$(NF-1)?a[$1]:$(NF-1)
}
END{
  for(i in a){ print i,a[i] }
}
'   Input_file

Solution 2nd: In case you need same sequence of 1st field same as Input_file then following may help you on same.

awk -F"," '!b[$1]++{c[++i]=$1} {a[$1]=a[$1]>$(NF-1)?a[$1]:$(NF-1)} END{for(j=1;j<=i;j++){print c[j],a[c[j]]}}'  Input_file

OR adding a non-one liner form of solution too now.

awk -F"," '
!b[$1]++{ c[++i]=$1 }
{
a[$1]=a[$1]>$(NF-1)?a[$1]:$(NF-1)
}
END{
for(j=1;j<=i;j++){
  print c[j],a[c[j]]}
}
'   Input_file

EDIT:

awk -F"," '!b[$1]++{c[++i]=$1} {a[$1]=a[$1]>$(NF-1)?a[$1]:$(NF-1)} END{for(j=1;j<=i;j++){print c[j],a[c[j]]}}' Input_file
NDUFAF7 0.216216
ESRRA 0.214286
HS3ST1 0.0833333
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
  • when I run it, in the output the second column would be exactly the same as 1st column. both are the names and there is no values. do you know what the problem is? –  Mar 27 '18 at 07:20
  • @user7249622, see my EDIT, I am getting proper results, check and let me know on same then. – RavinderSingh13 Mar 27 '18 at 07:23
  • @user7249622, also I saw your previous questions too, please always give feed back to people and let ALL know what answer is working for you by selecting a CORRECT ANSWER option for any post, you could up-vote to encourage people too. – RavinderSingh13 Mar 27 '18 at 07:27
1

You can use the following command:

$ sort -t',' -k1,2 -nr file | awk 'BEGIN{FS=",";OFS="\t"}{if(save!=$1){print $1,$2}save=$1}'                                                   
NDUFAF7 0.216216
HS3ST1  0.0833333
ESRRA   0.214286

Explanations:

the sort will sort your files first alphabetically on the name of the first column then on the 2nd column numerically in descending order. The awk command will get the first line of each cluster (the line with the maximum and adapt the output)

You can also use datamash:

$ datamash -t',' --sort --group 1 max 2  < file | tr ',' '\t'
ESRRA   0.214286
HS3ST1  0.0833333
NDUFAF7 0.216216

and sort the output depending on your needs.

Allan
  • 12,117
  • 3
  • 27
  • 51