-2

Hello I've multiple sequences in stockholm format, at the top of every alignment there is a accession ID, for ex: '#=GF AC PF00406' and '//' --> this is the end of the alignment. When I'm converting the stockholm format to fasta format I need PF00406 in the header of every sequence of the particular alignment. Some times there will be multiple stockholm alignments in one file. I tried to modify the following perl script, it gave me bizarre results, any help will be greatly appreciated.

my $columns = 60;
my $gapped = 0;

my $progname = $0;
$progname =~ s/^.*?([^\/]+)$/$1/;

my $usage = "Usage: $progname [<Stockholm file(s)>]\n";
$usage .=   "             [-h] print this help message\n";
$usage .=   "             [-g] write gapped FASTA output\n";
$usage .=   "             [-s] sort sequences by name\n";
$usage .=   "      [-c <cols>] number of columns for FASTA output (default is $columns)\n";
# parse cmd-line opts
my @argv;
while (@ARGV) {
    my $arg = shift;
    if ($arg eq "-h") {
    die $usage;
    } elsif ($arg eq "-g") {
    $gapped = 1;
    } elsif ($arg eq "-s"){
    $sorted = 1;
    } elsif ($arg eq "-c") {
    defined ($columns = shift) or die $usage;
    } else {
    push @argv, $arg;
    }
}
@ARGV = @argv;

my %seq;
while (<>) {
    next unless /\S/;
    next if /^\s*\#/;
    if (/^\s*\/\//) { printseq() }
    else {
    chomp;
    my ($name, $seq) = split;
    #seq =~ s/[\.\-]//g unless $gapped;
    $seq{$name} .= $seq;
    }
}
printseq();

sub printseq {
    if($sorted){
        foreach $key (sort keys %seq){
            print ">$key\n";
            for (my $i = 0; $i < length $seq{$key}; $i += $columns){
                print substr($seq{$key}, $i, $columns), "\n";
            }
        }
    } else{
            while (my ($name, $seq) = each %seq) {
            print ">$name\n";
            for (my $i = 0; $i < length $seq; $i += $columns) {
                    print substr ($seq, $i, $columns), "\n";
            }
        }
    }
    %seq = ();
}
Bionerd
  • 21
  • 1
  • 6
  • Please provide an example of the input data and your expected output. – ProfGhost Mar 11 '15 at 07:09
  • [https://www.dropbox.com/s/0fbxpybbg78np03/stockholm.sto?dl=0] please take a look at the link – Bionerd Mar 11 '15 at 07:48
  • When we run the above program the output will be like [https://www.dropbox.com/s/7e6k8xq88d9h0uw/output.fasta?dl=0] . I need the same but in the line that starts with **>** , should contain **PF00406** with space delimited – Bionerd Mar 11 '15 at 07:52
  • I would recommend that you update your question such that it can be understood also for those not familiar with bioinformatics.. Anyway.. in the `output.fasta` file there are several lines starting with a `>`. Do you need to update all those lines? For example, the first line is : `>D0FBM0_9VIBR/1-135`. How would you modify this line? – Håkon Hægland Mar 11 '15 at 12:50
  • yep I need to update every line. The input file contains first line as >D0FBM0_9VIBR/1-135 , same line in the output should contain **>D0FBM0_9VIBR/1-135 PF00406** – Bionerd Mar 11 '15 at 13:26

1 Answers1

1

Depending on the how much variation there is in the line with the accessionID, you might need to modify the regex, but this works for your example file

my %seq;
my $aln;
while (<>) {
    if ($_ =~ /#=GF AC (\w+)/) {
       $aln = $1;
    }
    elsif ($_ =~ /^\s*\/\/\s*$/){
       $aln = '';
    }
    next unless /\S/;
    next if /^\s*\#/;
    if (/^\s*\/\//) { printseq() }
    else {
       chomp;
       my ($name, $seq) = split;
       $name = $name . ' ' . $aln;
       $seq{$name} .= $seq;
    }
}
printseq();
heathobrien
  • 1,027
  • 7
  • 11
  • working well when the file contains only one stockholm alignment if there are more than one from 2nd on wards it is not as expected. – Bionerd Mar 11 '15 at 13:23
  • Like I said, you may need to modify the regular expression in the 4th line. You've only given me one example to work with, so I have no idea what the actual pattern looks like in your data. – heathobrien Mar 11 '15 at 13:56