sorting a multiFASTA file by DNA length

Question

I'm trying to sort a multiFASTA file by length. I have the alphabetical sort figured out but I can't seem to get the numerical sort. The output should be a sorted multiFASTA file. This is an option to another program. Here is the code.

sub sort {
my $length;
my $key;
my $id;
my %seqs;
my $seq;
my $action = shift;
my $match = $opts{$action};
$match =~ /[l|id]/ || die "not the right parameters\n";
my $in = Bio::SeqIO->new(-file=>"$filename", -format=>'fasta');
    while(my $seqobj = $in->next_seq()){
        my $id = $seqobj->display_id();
        my $length=$seqobj->length();
        #$seq =~s/.{1,60}\K/\n/sg;
        $seqs{$id} = $seqobj, unless $match eq 'l';
        $seqs{$length}=$seqobj, unless $match eq 'id';
    }
    if($match eq 'id'){
        foreach my $id (sort keys %seqs) {
             printf ">%-9s \n%-s\n", $id, $seqs{$id}->seq;
        }
    }
    elsif($match eq 'l'){
        foreach my $length ( sort keys %seqs){
             printf "%-10s\n%-s\n",$length, $seqs{$length}->seq;
        }
    }
}

score 0 · Answer 1 · answered Nov 03 '17 at 14:53

0

To sort numerically, you must provide the comparing subroutine:

sort { $a <=> $b } keys %seqs

Are you sure no two sequences can have the same length? $seqs{$length}=$seqobj overwrites the previously stored value.

answered Nov 03 '17 at 14:53

choroba

231,213
25
204
289

That's a problem that I'm having as well. It only displays 3 of the sequences. I didn't know that it would replace the value because with the alphabetical sort it's fine but now I understand it is because each id is different. – Jeffery Rosario Nov 03 '17 at 15:31
@JefferyRosario: Exactly. You can use hash of hashes and populate `$seqs{$length}{$id} = $seqobj`, and then you need two nested for loops to iterate over them, one sorted by length, the inner one by id. – choroba Nov 03 '17 at 16:04
Thank you so much. I figured it out. I also didn't need this line of code : sort { $a <=> $b } keys %seqs. – Jeffery Rosario Nov 03 '17 at 17:17
This is my updated code for the sorting by length: elsif($match eq 'l'){ foreach my $length(sort keys %seqs){ foreach my $id (keys %{ $seqs{$length}}) { printf "%-10s\n%-s\n" ,$id, $seqs{$length}{$id}->seq; – Jeffery Rosario Nov 03 '17 at 17:17
@JefferyRosario: But it sorts 20 before 3. – choroba Nov 03 '17 at 17:28

score 0 · Answer 2 · answered Nov 06 '17 at 10:40

A one lineer: use awk to linearize. a second awk to add a column containing the length, sort on this column, remove the column, restore the fasta sequence.

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}'  input.fa  |\
awk -F '\t' '{printf("%d\t%s\n",length($2),$0);}' |\
sort -t $'\t' -k1,1n |\
cut -f 2- |\
tr "\t" "\n"

PS: for bioinformartics questions, you should use https://www.biostars.org/, or https://bioinformatics.stackexchange.com/, etc...

score 0 · Answer 3 · answered Dec 06 '17 at 11:52

0

You can use pyfaidx or just take a look at jim hester repos. But as @pierre said above you should ask you question on biostars for example. The answer on biostars can be found here.

answered Dec 06 '17 at 11:52

Anicet Ebou

21
5

sorting a multiFASTA file by DNA length

3 Answers3