0

I have a file characterizing genomic regions that looks like this:

chrom   chromStart  chromEnd    PGB
chr1    12874   28371   2
chr1    15765   21765   1
chr1    15795   28371   2
chr1    18759   24759   1
chr1    28370   34961   1
chr3    233278  240325  1
chr3    239279  440831  2
chr3    356365  362365  1

Basically PGB describes the category of the genomic region characterised by its chromosome number (chrom), start (chromStart) and end (chromEnd) coordinates.

I wish to collapse the overlapping regions such that overlapping regions of PGB = 1 and 2 are in a new category, PGB = 3. Output being:

chrom   chromStart  chromEnd    PGB
chr1    12874   15764   2
chr1    15765   24759   3
chr1    24760   28369   2
chr1    28370   28371   3
chr1    28372   34961   1
chr3    233278  239278  1
chr3    239279  240325  3
chr3    240326  356364  2
chr3    356365  440831  3

Basically I wish to obtain an output file which reports unique regions. There are a two criteria.

First, if PGB (column 4) is identical between rows, merge range. eg.

chrom   chromStart  chromEnd    PGB
chr1    1   10  1
chr1    5   15  1

output

chrom   chromStart  chromEnd    PGB
chr1    1   15  1

Second, if PGB is different between rows, chr (column 1) is identical, and the ranges overlap (col2 and 3), report overlapping range as PGB = 3 as well as the ranges unique to their individual categories.

eg.

chrom   chromStart  chromEnd    PGB
chr1    30  100 1
chr1    50  150 2

output

chrom   chromStart  chromEnd    PGB
chr1    30  49  1
chr1    50  100 3
chr1    101 150 2

I hope that illustrates the problem better.

mpapec
  • 50,217
  • 8
  • 67
  • 127
  • Have you tried anything so far? – hmatt1 Mar 07 '14 at 05:36
  • I'm very new to perl/unix so I'm doing it manually on excel. Unfortunately I have 60000+ lines so I'm hoping for a faster alternative. – user3222627 Mar 07 '14 at 05:39
  • @user3222627 You need to explain a bit more on how you are getting your desired result. – jaypal singh Mar 07 '14 at 05:46
  • @jaypal, looks like there are ranges, chromStart to chromEnd. Each range has a number 1 or 2 assigned to it. If a range with a 1 overlaps with a range with a 2, the overlapping part is assigned PGB = 3. You'll see it if you look at the data close enough. It's a tricky transformation; I'm not sure what an efficient way to implement it would be. – hmatt1 Mar 07 '14 at 05:52
  • http://www.perlmonks.org/?node_id=823275 – mpapec Mar 07 '14 at 06:20
  • @user3222627 What happens when the PGB is different and the ranges don't overlap, such as the case in your first two lines of sample input? – jaypal singh Mar 07 '14 at 06:23
  • @jaypal If the PGB is different and their ranges don't overlap, they are reported as is. However the first two lines of my sample input do overlap (second line is completely within the first), which is why it is recategorised as PGB = 3. – user3222627 Mar 07 '14 at 06:53
  • @user3222627 Correct, but couldn't follow how `21765` became `24759`. – jaypal singh Mar 07 '14 at 06:57
  • @jaypal That's because line 4 overlaps line 2 and both have the same PGB value. So the combined region is 24759 under PGB = 1. – user3222627 Mar 07 '14 at 07:03
  • Possibly similar enough: [awk-to-find-overlaps](http://stackoverflow.com/questions/21488613/awk-to-find-overlaps)? – n0741337 Mar 07 '14 at 07:13
  • 1
    Dave Cross wrote an excellent book about solving these kinds of problems. You can get it as a pdf http://man.lupaworld.com/content/develop/Perl/Data_Munging_with_Perl.pdf – Vorsprung Mar 07 '14 at 09:47

1 Answers1

1

I've created a script that I believe accomplishes this goal.

use List::Util qw(max);

use strict;
use warnings;

# Skip Header row
<DATA>;

# With only 60k rows, going to just slurp all the data
my %chroms;
while (<DATA>) {
    chomp;
    my ($chrom, $start, $end, $pgb) = split ' ';

    # Basic Data Validation.
    warn "chromStart is NaN, '$start'" if $start =~ /\D/;
    warn "chromEnd is NaN, '$end'" if $end =~ /\D/;
    warn "range not ordered, '$start' to '$end'" if $start > $end;
    warn "PGB only can be 1 or 2, '$pgb'" if ($pgb ne '1' && $pgb ne '2');

    push @{$chroms{$chrom}{$pgb}}, [$start, $end];
}

print "chrom    chromStart  chromEnd    PGB\n";

# Process each Chrom
for my $chrom (sort keys %chroms) {
    for my $pgb (sort keys %{$chroms{$chrom}}) {
        my $ranges = $chroms{$chrom}{$pgb};

        # Sort Ranges
        @$ranges = sort {$a->[0] <=> $b->[0] || $a->[1] <=> $b->[1]} @$ranges;

        # Combine overlapping and continguous ranges.
        # - Note because we're dealing with integer ranges, 1-4 & 5-8 are contiguous
        for my $i (0..$#$ranges-1) {
            if ($ranges->[$i][1] >= $ranges->[$i+1][0] - 1) {
                $ranges->[$i+1][0] = $ranges->[$i][0];
                $ranges->[$i+1][1] = max($ranges->[$i][1], $ranges->[$i+1][1]);
                $ranges->[$i] = undef;
            }
        }
        @$ranges = grep {$_} @$ranges;
    }

    # Create pgb=3 for overlaps.
    # - Save old ranges into aliases, and then start fresh
    my $pgb1array = $chroms{$chrom}{1};
    my $pgb2array = $chroms{$chrom}{2};
    my @ranges;

    # Always working on the first range in each array, until one of the arrays is empty
    while (@$pgb1array && @$pgb2array) {
        # Aliases to first element
        my $pgb1 = $pgb1array->[0];
        my $pgb2 = $pgb2array->[0];

        # PGB1 < PGB2
        if ($pgb1->[1] < $pgb2->[0]) {
            push @ranges, [@{shift @$pgb1array}, 1]

        # PGB2 < PGB1
        } elsif ($pgb2->[1] < $pgb1->[0]) {
            push @ranges, [@{shift @$pgb2array}, 2]

        # There's overlap for all rest 
        } else {
            # PGB1start < PGB2start
            if ($pgb1->[0] < $pgb2->[0]) {
                push @ranges, [$pgb1->[0], $pgb2->[0]-1, 1];
                $pgb1->[0] = $pgb2->[0];

            # PGB2start < PGB1start
            } elsif ($pgb2->[0] < $pgb1->[0]) {
                push @ranges, [$pgb2->[0], $pgb1->[0]-1, 2];
                $pgb2->[0] = $pgb1->[0];
            }
            # (Starts are equal now)

            # PGB1end < PGB2end
            if ($pgb1->[1] < $pgb2->[1]) {
                $pgb2->[0] = $pgb1->[1] + 1;
                push @ranges, [@{shift @$pgb1array}, 3];

            # PGB2end < PGB1end
            } elsif ($pgb2->[1] < $pgb1->[1]) {
                $pgb1->[0] = $pgb2->[1] + 1;
                push @ranges, [@{shift @$pgb2array}, 3];

            # PGB2end = PGB1end
            } else {
                push @ranges, [@$pgb1, 3];
                shift @$pgb1array;
                shift @$pgb2array;
            }
        }
    }

    # Append whichever is left over
    push @ranges, map {$_->[2] = 1; $_} @$pgb1array;
    push @ranges, map {$_->[2] = 2; $_} @$pgb2array;

    printf "%-8s %-11s %-11s %s\n", $chrom, @$_ for (@ranges);
}

1;

__DATA__
chrom   chromStart  chromEnd    PGB
chr1    12871   12873   2
chr1    12874   28371   2
chr1    15765   21765   1
chr1    15795   28371   2
chr1    18759   24759   1
chr1    28370   34961   1
chr3    233278  240325  1
chr3    239279  440831  2
chr3    356365  362365  1

And the result:

chrom    chromStart  chromEnd    PGB
chr1     12871       15764       2
chr1     15765       24759       3
chr1     24760       28369       2
chr1     28370       28371       3
chr1     28372       34961       1
chr3     233278      239278      1
chr3     239279      240325      3
chr3     240326      356364      2
chr3     356365      362365      3
chr3     362366      440831      2

Note: I created this mostly as an exercise for myself, but if you use it in any way, please let me know.

Miller
  • 34,962
  • 4
  • 39
  • 60
  • Added minor fix for contiguous ranges. 1-4 & 5-8 should be joined since we're only dealing with integer ranges. – Miller Mar 10 '14 at 01:27