Perl sort genomic positions

Question

I have a list of genomic positions in the format chromosome:start-end

for example

chr1:100-110
chr1:1000-1100
chr1:200-300
chr10:100-200
chr2:100-200
chrX:100-200

I want to sort this by chromosome number and numerical start position to get this:

chr1:100-110
chr1:200-300
chr1:1000-1100
chr2:100-200
chr10:100-200
chrX:100-200

What is a good and efficient way to do this in perl?

Is `X` bigger or smaller than `1X`? In general, how are chromosome "numbers" compared? — Borodin, Jul 18 '14 at 14:37
chromosomes should be sorted karyotypically: chr1, chr2 ... chr22, chrX, chrY — gwo, Jul 18 '14 at 14:45
to rephrase my question: I am loking for some perl code that does something equivalent to the unix version sort: `sort -k1,1V` but inside perl on a list. — gwo, Jul 18 '14 at 14:47
@gwo You could always shell out to use unix sort if you wanted. — Hunter McMillen, Jul 18 '14 at 15:00
For those interested, what `sort -V` does is encapsulated in [`filevercmp.c`](https://github.com/ekg/filevercmp/blob/master/filevercmp.c). The best Perl equivalent that I know is the [`Sort::Naturally`](https://metacpan.org/module/Sort::Naturally) module — Borodin, Jul 18 '14 at 15:39
@Borodin A better natural sort module is: [`Sort::Key::Natural`](https://metacpan.org/pod/Sort::Key::Natural). There are flaws in the implementation of nsort. It strips all non-word characters from strings and then only looks for alternation between alpha and numeric. For example `perl -MSort::Naturally -E 'say for nsort qw(1-000 2-0 3-0)'` will display as 2,3,1. Where as `perl -MSort::Key::Natural=natsort -E 'say for natsort qw(1-000 2-0 3-0)'` will DWIM. Haven't decided if it's worth contact the author of S::N about it yet, but I'm going to start recommending the alternate module. — Miller, Aug 11 '14 at 23:33

Matthew Franglen · Answer 1 · 2014-07-18T15:26:01.977

You can sort this by providing a custom comparator. It appears that you want a two level value as the sorting key, so your custom comparator would derive the key for a row and then compare that:

# You want karyotypical sorting on the first element,
# so set up this hash with an appropriate normalized value
# per available input:

my %karyotypical_sort = (
    1 => 1,
    ...
    X => 100,
);

sub row_to_sortable {
    my $row = shift;
    $row =~ /chr(.+):(\d+)-/; # assuming match here! Be careful
    return [$karyotypical_sort{$1}, $2];
}

sub sortable_compare {
    my ($one, $two) = @_;

    return $one->[0] <=> $two->[0] || $one->[1] <=> $two->[1];
    # If first comparison returns 0 then try the second
}

@lines = ...

print join "\n", sort {
    sortable_compare(row_to_sortable($a), row_to_sortable($b))
} @lines;

Since the calculation would be slightly onerous (string manipulation is not free) and since you are probably dealing with a lot of data (genomes!) it is likely you will notice improved performance if you perform a Schwartzian Transform. This is performed by precalculating the sort key for the row and then sorting using that and finally removing the additional data:

@st_lines = map { [ row_to_sortable($_), $_ ] } @lines;
@sorted_st_lines = sort { sortable_compare($a->[0], $b->[0]) } @st_lines;
@sorted_lines = map { $_->[1] } @sorted_st_lines;

Or combined:

print join "\n",
    map { $_->[1] }
    sort { sortable_compare($a->[0], $b->[0]) }
    map { [ row_to_sortable($_), $_ ] } @lines;

score 1 · Answer 2 · edited Aug 10 '14 at 06:30

1

It looks to me like you want to sort in order of the following:

By Chromosome Number
Then by the Start Position
Then (maybe) by the End Position.

So, perhaps a custom sort like this:

use strict;
use warnings;

print sort {
    my @a = split /chr|:|-/, $a;
    my @b = split /chr|:|-/, $b;
    "$a[1]$b[1]" !~ /\D/ ? $a[1] <=> $b[1] : $a[1] cmp $b[1]
      or $a[2] <=> $b[2]
      or $a[3] <=> $b[3]
} <DATA>;

__DATA__
chr1:100-110
chr1:1000-1100
chr1:200-300
chr10:100-200
chr2:100-200
chrX:100-200
chrY:100-200
chrX:1-100
chr10:100-150

Outputs:

chr1:100-110
chr1:200-300
chr1:1000-1100
chr2:100-200
chr10:100-150
chr10:100-200
chrX:1-100
chrX:100-200
chrY:100-200

edited Aug 10 '14 at 06:30

Miller

34,962
4
39
60

answered Jul 18 '14 at 15:15

Randall

2,859
1
21
24

1

Doesn't this throw a warning when you compare chrX? Assuming you are using `use warnings;` – Hunter McMillen Jul 18 '14 at 15:20
1

`split(/:-/, $a)` does not split on : or -, instead it looks for :- together. I think you wanted [:-] – Matthew Franglen Jul 18 '14 at 15:37
You're right @HunterMcMillen. I've corrected the code. That'll teach me not to skip 'use warnings' even for a quick proof of concept! – Randall Jul 18 '14 at 18:13
Thanks, @MatthewFranglen - that goof became apparent as soon as I put in 'use warnings' :-) – Randall Jul 18 '14 at 18:13

score 1 · Accepted Answer · answered Aug 11 '14 at 21:43

Just use the module Sort::Keys::Natural:

use strict;
use warnings;

use Sort::Key::Natural qw(natsort);

print natsort <DATA>;

__DATA__
chr1:100-110
chr1:1000-1100
chr1:200-300
chr10:100-200
chr2:100-200
chrX:100-200
chrY:100-200
chrX:1-100
chr10:100-150

Outputs:

chr1:100-110
chr1:200-300
chr1:1000-1100
chr2:100-200
chr10:100-150
chr10:100-200
chrX:1-100
chrX:100-200
chrY:100-200

score 0 · Answer 4 · answered Jul 18 '14 at 15:15

You could do something like this the following script, which takes a text file given your above input. The sorting on the chromosome number would need to change a bit because it's not purely lexical or numerical. But i'm sure you could tweak what I have below:

use strict;

my %chromosomes;

while(<>){
        if ($_ =~ /^chr(\w+):(\d+)-\d+$/)
        {
                my $chr_num = $1;
                my $chr_start = $2;
                $chromosomes{$1}{$2} = $_;
        }
}

my @chr_nums = sort(keys(%chromosomes));
foreach my $chr_num (@chr_nums) {
        my @chr_starts = sort { $a <=> $b }(keys(%{$chromosomes{$chr_num}}));
         foreach my $chr_start (@chr_starts) {
                print "$chromosomes{$chr_num}{$chr_start}";
        }
 }

1;

score 0 · Answer 5 · edited May 23 '17 at 12:13

0

There is a similar question asked and answered here:

How to do alpha numeric sort perl?

What you are likely looking for is a general numeric sort, like using sort -g.

edited May 23 '17 at 12:13

Community

1
1

answered Jul 22 '14 at 00:41

Vince

3,325
2
23
41

Perl sort genomic positions

5 Answers5