How do I extract DNA sequences from a text file without reading line by line?

Question

I'm trying to extract a DNA sequence from a text file and store it. I can do it using the following code, but it's not the best way because I'm reading the text file line by line. I'm wondering if there's an easier way to find each of the DNA sequences in my text file without reading the text file line by line.

example.pl

#!/usr/local/bin/perl
open(MYFILE, 'data.txt');
@entire_file = <MYFILE>;
while (<MYFILE>) {
    chomp;
    print "$_\n";
}

$line1 = <MYFILE>;
chomp $line1;
$line2 = <MYFILE>;
chomp $line2;
$line3 = <MYFILE>;
chomp $line3;
$line4 = <MYFILE>;
chomp $line4;
$line5 = <MYFILE>;
chomp $line5;

#Prints DNA sequence 1
print "$line2";

#Prints DNA sequence 2
print "$line5";

close(MYFILE);

data.txt

gi|171361, Saccharomyces cerevisiae, (CYS3) gene, Lab 1, Joe Bloggs GCAGCGATCGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC

gi|171362, Saccharomyces cerevisiae, (CYS4) gene, Lab 2, Paul McDonald GAAGCGCACGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC

This should not work because you are reading the entire file and then tying to read more data. You should work on `@entire_file` instead of `` after the loop. — perreal, Aug 14 '12 at 14:38
i' have been readin up on pattern matching, just not really sure how to do it. So many symbols. I'd like to be able to recognise a pattern like a DNA sequence GATC...etc and store it, without having to read each line in the text file. if ya can help, plz. thanks. :) — Conor C, Aug 14 '12 at 14:49
i'm doing a question and the first part says extract the content of the txt file containing both FASTA formatted file, tats why @entire file is there. then it says to extract the descriptor line, which is done by $line, and then each DNA sequence which i can do, buts its not a great way of doing it, which is why i posted the question. — Conor C, Aug 14 '12 at 14:53
I'm not sure qhat you're asking. The program you say works does nothing at all except read the file in and throw away the data. `$line2`, `$line3` etc. are all set to `undef`. You aren't printing anything. If this is homework you have surely been given some clue how to answer it? — Borodin, Aug 14 '12 at 14:57
their are 2 descriptor line in the above text file 'data.txt. haha no its not homework. — Conor C, Aug 14 '12 at 15:12
I'm writing a perl script that can: - Extract the contents of a text file, called data.txt, that contains two FASTA formatted files. | - Extract the descriptor line from each FASTA formatted files. | - Extract DNA sequence from the files and store in separate arrays. — Conor C, Aug 14 '12 at 15:19

score 3 · Answer 1 · answered Aug 14 '12 at 17:46

Here is an example using BioPerl's module, Bio::SeqIO;

#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;

my $in  = Bio::SeqIO->new( -file   => "junk.txt" ,
                           -format => 'FASTA');

while ( my $seq = $in->next_seq() ) {
    printf "id: %s\ndescr: %s\nseq: %s\n\n", $seq->id, $seq->desc, $seq->seq;
}

__END__
Contents of junk.txt

>gi|171361, Saccharomyces cerevisiae, (CYS3) gene, Lab 1, Joe Bloggs
GCAGCGATCGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCG
CTTGCGAAAGCATCGAGTACC
>gi|171362, Saccharomyces cerevisiae, (CYS4) gene, Lab 2, Paul McDonald
GAAGCGCACGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCG
CTTGCGAAAGCATCGAGTACC

And, here is the result of running the ptogram.

C:\Old_Data\perlp>perl t5.pl
id: gi|171361,
descr: Saccharomyces cerevisiae, (CYS3) gene, Lab 1, Joe Bloggs
seq: GCAGCGATCGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC

id: gi|171362,
descr: Saccharomyces cerevisiae, (CYS4) gene, Lab 2, Paul McDonald
seq: GAAGCGCACGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC

Using the `Bio::SeqIO` module is an excellent solution, so +1. Have updated mine to show the id. — Kenosis, Aug 14 '12 at 18:31
I would suggest using bio perl. But you could also try http://code.izzid.com/2011/10/31/How-to-read-a-fasta-file-in-perl.html — ekawas, Aug 14 '12 at 18:37

score 1 · Answer 2 · answered Aug 14 '12 at 16:20

1

After

@entire_file = <MYFILE>;

you have the entire file saved in the array @entire_file. Everything else you are doing with the readline operators (<..>) afterwards will not work, because the file has already been read in its entirety.

You can loop over the elements in the array and do whatever you want with them, e.g.,

foreach my $line (@entire_file) {
  if ($line =~ /^gi/) { print "Descriptor: $line" }
  else { print "Sequence: $line" }
}

I suggest that you read up on reading files, pattern matching and loops in general.

answered Aug 14 '12 at 16:20

mpe

1,000
1
8
25

Consider adding `next unless $line =~ /\S/;` before the conditional so blank lines are skipped, else they'll be displayed as a Sequence. Also, the FASTA lines actually begin with >, but the current formatting is not showing those characters so `$line =~ /^>gi/` would be needed. – Kenosis Aug 14 '12 at 16:52
Thanks for the help and feedback. i'll do that. :) – Conor C Aug 14 '12 at 17:52

Kenosis · Answer 3 · 2012-08-14T18:25:04.720

If you have all of your file's lines in an array, you can iterate over that array to grab the id/descriptor and sequence elements w/o using a regex:

use Modern::Perl;
use Data::Dumper;

my ( @id, @des, @dna );
chomp( my @FASTA = <DATA> );

for ( my $i = 0 ; $i < @FASTA ; $i += 3 ) {
    my ( $id, $des ) = split ', ', $FASTA[$i], 2;
    push @id,  $id;
    push @des, $des;
    push @dna, $FASTA[ $i + 1 ];
}

say Dumper \@id, \@des, \@dna;

say @FASTA + 0;

__DATA__
>gi|171361, Saccharomyces cerevisiae, (CYS3) gene, Lab 1, Joe Bloggs
GCAGCGATCGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC

>gi|171362, Saccharomyces cerevisiae, (CYS4) gene, Lab 2, Paul McDonald
GAAGCGCACGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC

Output:

$VAR1 = [
          '>gi|171361',
          '>gi|171362'
        ];
$VAR2 = [
          'Saccharomyces cerevisiae, (CYS3) gene, Lab 1, Joe Bloggs',
          'Saccharomyces cerevisiae, (CYS4) gene, Lab 2, Paul McDonald'
        ];
$VAR3 = [
          'GCAGCGATCGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC',
          'GAAGCGCACGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC'
        ];

score 0 · Answer 4 · answered Aug 14 '12 at 22:30

0

If you just want the sequences from the command line, this one-liner would do it:

perl -lane 'print $F[-1] if @F' data.txt

See perlrun(1) for details.

How do I extract DNA sequences from a text file without reading line by line?

4 Answers4