-1

I have a text file as follows

ID     Name position_start position_end
ID01    P889      290       298
ID01    P889      290       299
ID02    O991      400       405
ID02    O991      355       373
ID02    O991      403       404
ID05    Q151      428       429
ID05    Q151      428       428
ID05    Q151      24        24
ID05    Q151      14        25

I would like to extract the longest starting and ending positions of each ID. My desired output is shown below.

    ID      Name  position_start position_end
    ID01    P889      290       299
    ID02    O991      400       405
    ID02    O991      355       373
    ID05    Q151      428       429
    ID05    Q151      14        25
  • what do you mean by longest starting and ending positions? – Pratik Singhal Jan 18 '14 at 05:23
  • For example, ID01 has two positions 290 to 298 and 290 to 299. I need the positions from 290 to 299. Hope you can understand! – user3209035 Jan 18 '14 at 05:35
  • Why are there two outputs for ID02/O991? And why two for ID05/Q151? The lengths aren't even the same, so it is not a question of dealing with ties. Will the data be presented in sorted order? The chances are it doesn't matter if you're using `awk` as in the tags. – Jonathan Leffler Jan 18 '14 at 05:57
  • ID02 and ID05 have two outputs because number ranges are different.403 to 404 in ID02 hasn't included in the output because 403 to 404 is already within the range from 400 to 405. In the same way, in ID05 24 to 24 hasn't included. It's already within a range from 14 to 25. – user3209035 Jan 18 '14 at 06:36

1 Answers1

2
sort -k1,1 -k2,2 -k3,3n -k4,4n file > temp

awk 'NR==1{print;next}
NR==2{start=$3;end=$4;id=$1 OFS $2;next}
{ if ($1 OFS $2 == id &&$3<=end) 
      {end=end>$4?end:$4;next}
   print id,start,end;start=$3;end=$4;id=$1 OFS $2
}END{print id,start,end}' OFS="\t" temp

ID     Name position_start position_end
ID01    P889    290     299
ID02    O991    355     373
ID02    O991    400     405
ID05    Q151    14      25
ID05    Q151    428     429
BMW
  • 42,880
  • 12
  • 99
  • 116