-3

I have a big file having 50s columns and 100K of rows delimited by |. Now $2(col 2) has multiple type of $1(col 1) value which means col 2 will be repeated. So I have sorted the file. I need now to extract/filter the result file on the basis of below condition:

  • condition 1 : when $2 has both type of $1 (value of $1 for $2 is more than 8000 as well as less than 8000) then select the complete row which is for $1 < 8000 for the given $2
  • condition 2: if $2 has only $1 > 8000 then select the complete row which is having max $8 value

E.g: source file

4000|1234||||||23
5000|1234||||||22
9000|1234||||||25
10000|123|||||||22
22000|456|||||||27
15000|456|||||||29

result file would have:

9000|1234||||||25
10000|123|||||||23
15000|456|||||||29

Can anyone please advice on this ? Thanks in advance

appi
  • 75
  • 1
  • 3
  • 10
  • Why is `22000` not in output? – anubhava May 27 '16 at 11:27
  • Why is 22000 not in output? ANS-- Because $2(456) has values only >8000 so row selected on the basis of $9 (29) having max value as 29>27 – appi May 27 '16 at 11:31
  • It is unclear, why is `123` row not showing `22` in last column? – anubhava May 27 '16 at 11:34
  • 1
    What is the meaning of " when `$2` has both type of `$1`"? – Michael Vehrs May 27 '16 at 12:04
  • 1
    The wording is unclear. `Now $2(col 2) has multiple type of $1(col 1) value which means col 2 will be repeated. ` -- what does `multiple type` mean? – agc May 27 '16 at 12:06
  • agc --- as you can see column 2 ($2) has in one to many relation with column 1 ( $1). this is what i mean . Please let me know is it clear now??? – appi May 29 '16 at 04:44

1 Answers1

0

Here is the ANSWER:I got...

 sort -n -t\| -k2 -k1 < sortexp.txt |awk -F\| '$1 < 8000 { a[$2]++ ; print } $1 >= 8000 { if ( !a[$2] && ( !e[$2] || e[$2]<$8 ))  {u[$2]=$0;e[$2]=$8;} ; } END { for ( i in u ) print u[i] ;}'
appi
  • 75
  • 1
  • 3
  • 10