0

I have a table with five columns. I want to merge start and end columns if they overlap, and have same RNAiclone and target_mRNA name. If the start-end of two entries are: (A) 1-10, 11-20 means overlapping range; while (B)1-10, 12-20 means no-overlapping range. RNAilength(nt) is same for similar RNAiclone.

input.txt

RNAiclone   RNAilength(nt)  target_mRNA  start   end
siRNA1      10              mRNA1         1      10
siRNA1      10              mRNA1         11     20
siRNA1      10              mRNA1         17     30
siRNA1      10              mRNA2         18     19
siRNA2      20              mRNA2         1      10
siRNA2      20              mRNA2         9      100

expected output.txt

RNAiclone   RNAilength(nt)  target_mRNA   start   end
siRNA1      10              mRNA1         1       30
siRNA1      10              mRNA2         18      19
siRNA2      20              mRNA2         1       100

program.awk

BEGIN{
  i=0;
  s="";
  m="";
  OFS="\t";
}
{
  if (s!=$1 && m!=$3){
    if (s != "" && m!= ""){
      combine(chr,s,m,i);
    }
    i=0;
    s="";
  }
  s=$1;
  m=$3;
  chr[i,0]=$4;
  chr[i,1]=$5;
  i++
}
END{
  combine(chr,s,m,i);
}
function combine(arr,s,m,i) {
  j=0;
  new[j,0]=arr[0,0];
  new[j,1]=arr[0,1];
  for (k=1;k<i;k++)
    {
      if ((arr[k,0]<=new[j,1])&&(arr[k,1]>=new[j,1])){
    new[j,1]=arr[k,1];
      }
      else if (arr[k,0]>new[j,1]){
    j++;
    new[j,0]=arr[k,0];
    new[j,1]=arr[k,1];
      }
    }
  for (n=0;n<=j;n++){
    print s,m,new[n,0],new[n,1]
  }
}

I am running the script using command "wk -f program.awk input.txt > output.txt", but I am not getting the expected result. Could you kindly help me to correct the script. Thank you very much.

Miller
  • 34,962
  • 4
  • 39
  • 60
firoz
  • 91
  • 7
  • Someone had a similar problem 3 months ago, [`extract overlapping regions`](http://stackoverflow.com/q/22242098/1733163). Yours is much simpler though. Will chime in later tonight if noone else has. – Miller Jun 05 '14 at 02:10

2 Answers2

0

It's easier for me to try a reboot than debug your code. Here's an alternate in awk. Put the following into an executable awk file:

#!/usr/bin/awk -f

BEGIN { OFS="\t" }

/^RNA/ { print; next }

{
  key = $1 OFS $2 OFS $3

  # new key or new range, print last line
  if( key != l_key || l_end+1 < $4 ) {
    if( FNR>1 && l_key ) { print l_key, s[l_key], l_end }
    s[key]=$4
  }

  l_key=key
  l_end=$5
}

END { print l_key, s[l_key], l_end } # print final range

Calling that file awko (and chmod +x awko), then running like awko input.txt gives the following:

RNAiclone   RNAilength(nt)  target_mRNA  start   end
siRNA1  10  mRNA1   1   30
siRNA1  10  mRNA2   18  19
siRNA2  20  mRNA2   1   100

which can be re-aligned with awko input.txt | column -t:

RNAiclone  RNAilength(nt)  target_mRNA  start  end
siRNA1     10              mRNA1        1      30
siRNA1     10              mRNA2        18     19
siRNA2     20              mRNA2        1      100

so the final command would be

awko data | column -t > output.txt

This has at least the following assumptions:

  • data is sorted by columns 1-4 as in the example input.txt
  • the ranges behave as if end always >= start (I didn't test anything else)
n0741337
  • 2,474
  • 2
  • 15
  • 15
0

You've already gotten your awk answer, but here is a version that does the same in perl:

use strict;
use warnings;

my $last;
while (<>) {
    my @cols = split;
    my $key = join ' ', splice @cols, 0, 3;
    my ($start, $end) = @cols;
    if ($last) {
        if ($last->[0] ne $key || $last->[2] < $start - 1) {
            print "@$last\n"
        } else {
            $start = $last->[1];
            $end   = $last->[2] if $end < $last->[2];
        }
    }
    $last = [$key, $start, $end];
}
print "@$last\n";

Executing this program perl merge_range.pl file.dat gives the following results:

RNAiclone RNAilength(nt) target_mRNA start end
siRNA1 10 mRNA1 1 30
siRNA1 10 mRNA2 18 19
siRNA2 20 mRNA2 1 100

Assumes data is sorted like in your example data. To format the columns, just pipe through column -t.

Miller
  • 34,962
  • 4
  • 39
  • 60