-2

I'm still learning Perl and I have a program which is able to take a FASTA file sequence header and print only the species name within square brackets. I want to add to this code to have it also print the entire sequence associated with the species.

Here is my code:

#!/usr/bin/perl
use warnings;

my $file = 'seqs.fasta';
my $tmp = 'newseqs.fasta';
open(OUT, '>', $tmp) or die "Can't open $tmp: $!";
open(IN, '<', $file) or die "Can't open $file: $!";

while(<IN>) {
    chomp;
    if ( $_ =~ /\[([^]]+)\]/ ) {
        print OUT "$1\n";
    }
}

close(IN);
close(OUT);

Here is a sample of the original FASTA file I had:

>gi|334187971|ref|NP_001190408.1| Cam-binding protein 60-like G [Arabidopsis thaliana] >gi|332006244|gb|AED93627.1| Cam-binding protein 60-like G [Arabidopsis thaliana]
MKIRNSPSFHGGSGYSVFRARNLTFKKVVKKVMRDQSNNQFMIQMENMIRRIVREEIQRSLQPFLSSSCVSMERSRSETP
SSRSRLKLCFINSPPSSIFTGSKIEAEDGSPLVIELVDATTNTLVSTGPFSSSRVELVPLNADFTEESWTVEGFNRNILT
QREGKRPLLTGDLTVMLKNGVGVITGDIAFSDNSSWTRSRKFRLGAKLTGDGAVEARSEAFGCRDQRGESYKKHHPPCPS
DEVWRLEKIAKDGVSATRLAERKILTVKDFRRLYTIIGAGVSKKTWNTIVSHAMDCVLDETECYIYNANTPGVTLLFNSV
YELIRVSFNGNDIQNLDQPILDQLKAEAYQNLNRITAVNDRTFVGHPQRSLQCPQDPGFVVTCSGSQHIDFQGSLDPSSS
SMALCHKASSSTVHPDVLMSFDNSSTARFHIDKKFLPTFGNSFKVSELDQVHGKSQTVVTKGCIENNEEDENAFSYHHHD
DMTSSWSPGTHQAVETMFLTVSETEEAGMFDVHFANVNLGSPRARWCKVKAAFKVRAAFKEVRRHTTARNPREGL

Currently, the output only pulls the species name Arabidopsis thaliana

However, I want it to print properly in a fasta file as such:

>Arabidopsis thaliana
MKIRNSPSFHGGSGYSVFRARNLTFKKVVKKVMRDQSNNQFMIQMENMIRRIVREEIQRSLQPFLSSSCVSMERSRSETP
SSRSRLKLCFINSPPSSIFTGSKIEAEDGSPLVIELVDATTNTLVSTGPFSSSRVELVPLNADFTEESWTVEGFNRNILT
QREGKRPLLTGDLTVMLKNGVGVITGDIAFSDNSSWTRSRKFRLGAKLTGDGAVEARSEAFGCRDQRGESYKKHHPPCPS
DEVWRLEKIAKDGVSATRLAERKILTVKDFRRLYTIIGAGVSKKTWNTIVSHAMDCVLDETECYIYNANTPGVTLLFNSV
YELIRVSFNGNDIQNLDQPILDQLKAEAYQNLNRITAVNDRTFVGHPQRSLQCPQDPGFVVTCSGSQHIDFQGSLDPSSS
SMALCHKASSSTVHPDVLMSFDNSSTARFHIDKKFLPTFGNSFKVSELDQVHGKSQTVVTKGCIENNEEDENAFSYHHHD
DMTSSWSPGTHQAVETMFLTVSETEEAGMFDVHFANVNLGSPRARWCKVKAAFKVRAAFKEVRRHTTARNPREGL

Could you suggest ways to modify the code to achieve this?

tripleee
  • 175,061
  • 34
  • 275
  • 318
Elle
  • 97
  • 2
  • 6
  • 14

2 Answers2

2

That's because what this does:

if ( $_ =~ /\[([^]]+)\]/ ) {
        print OUT "$1\n";
}

Is find and capture any text in []. But if that pattern doesn't match, you don't do anything else with the line - like print it.

Adding:

else {
    print OUT $_;
}

Will mean if a line doesn't contain [] it'll get printed by default.

I will also suggest:

  • turn on use strict;.
  • lexical filehandles are good practice: open ( my $input, '<', $file ) or die $!;
  • a pattern match implicitly applies to $_ by default. So you can write that 'if' as if ( /\[([^]]+)\]/ )
Sobrique
  • 52,974
  • 7
  • 60
  • 101
  • Thank you, that makes sense. What would be the best way to make sure the default printing doesn't include what is in the header and only includes the actual sequence after the header? – Elle Nov 09 '15 at 17:59
  • Er. Depends which bit is the "header" you refer to. But you can do e.g. `next if m/>/` to skip lines that contain a `>` – Sobrique Nov 09 '15 at 18:01
  • Oh wow, thanks for your suggestions! I can see where this would work. To format the fasta file properly to include `>` before the species name and print on the next line, I can just add \n to the `print OUT $_` statement yes? – Elle Nov 09 '15 at 18:06
  • Yes. Or just don't `chomp` the line in the first place, and it'll keep it's linefeed. – Sobrique Nov 09 '15 at 18:08
0

A couple of general points about your program

  • You must always use strict as well as use warnings 'all' at the top of every Perl program you write. It will reveal many simple mistakes that you could otherwise easily overlook

  • You have done well to choose the three-parameter form of open, but you should also use lexical file handles. So this line

    open(OUT, '>', $tmp) or die "Can't open $tmp: $!";

should be written as
    open my $out_fh, '>', $tmp or die "Can't open $tmp: $!";
  • It's probably best to supply the input and output file names on the command line, so you don't have to edit your program to run it against different files

I would solve your problem like this. It checks to see if each line is a header that contains a string enclosed in square brackets. The first test is that the line starts with a close angle bracket >, and the second test is the same as you wrote in your own program that captures the bracketed string — the species name

If these checks are passed then the species name is printed with an closing angle bracket and a newline, otherwise the line is printed as it is

This program should be run like this

$ fasta_species.pl seqs.fasta > newseqs.fasta

The dollar is just the Linux prompt character, and it assumes you have put the program in a file names fasta_species.pl. You can omit the > newseqs.fasta to display the output directly to the screen so that you can see what is being produced without creating an output file and editing it

use strict;
use warnings 'all';

while ( <> ) {
    if ( /^>/ and / \[ ( [^\[\]]+ ) \] /x ) {
        print ">$1\n";
    }
    else {
        print;
    }
}

output

>Arabidopsis thaliana
MKIRNSPSFHGGSGYSVFRARNLTFKKVVKKVMRDQSNNQFMIQMENMIRRIVREEIQRSLQPFLSSSCVSMERSRSETP
SSRSRLKLCFINSPPSSIFTGSKIEAEDGSPLVIELVDATTNTLVSTGPFSSSRVELVPLNADFTEESWTVEGFNRNILT
QREGKRPLLTGDLTVMLKNGVGVITGDIAFSDNSSWTRSRKFRLGAKLTGDGAVEARSEAFGCRDQRGESYKKHHPPCPS
DEVWRLEKIAKDGVSATRLAERKILTVKDFRRLYTIIGAGVSKKTWNTIVSHAMDCVLDETECYIYNANTPGVTLLFNSV
YELIRVSFNGNDIQNLDQPILDQLKAEAYQNLNRITAVNDRTFVGHPQRSLQCPQDPGFVVTCSGSQHIDFQGSLDPSSS
SMALCHKASSSTVHPDVLMSFDNSSTARFHIDKKFLPTFGNSFKVSELDQVHGKSQTVVTKGCIENNEEDENAFSYHHHD
DMTSSWSPGTHQAVETMFLTVSETEEAGMFDVHFANVNLGSPRARWCKVKAAFKVRAAFKEVRRHTTARNPREGL
Borodin
  • 126,100
  • 9
  • 70
  • 144