Bio Perl: code to split paired end data?

Question

I am a beginner with bioinformatics and I have been working on a little Bio Perl code to split my paired end MiSeq data (currently in 1 fastq file) into 2 files, each file containing one end of the pair. The different ends of the paired end reads can be distinguished by a 1 or a 2 after the space in the fastq header. The file follows a typical fastq format, example from using "head" in the command line:

@M00763:6:000000000-A1U80:1:1101:12620:1732 1:N:0:1
TTATACTC
+
@A@AA@A@
@M00763:6:000000000-A1U80:1:1101:12620:1732 2:N:0:1
T
+
E

I have written a code trying to target the 1 or 2 in the header using a match. Although I am using Bio::SeqIO perl does not seem to be recognizing the fastq format, and I keep getting this error:

MSG: Could not guess format from file/fh
STACK: Error::throw
STACK: Bio::Root::Root::throw /sw/lib/perl5/5.12.3/Bio/Root/Root.pm:472
STACK: Bio::SeqIO::new /sw/lib/perl5/5.12.3/Bio/SeqIO.pm:389
STACK: SplitPairedEndReads.pl:7

Can someone help me find/fix my error? The information available from BioPerl website indicates that Bio::SeqIO should be able to recognize fastq format.

Here is the code I have written:

#!/usr/bin/perl 

use Bio::SeqIO;
use Bio::SeqIO::fastq;


$seqout1 = Bio::SeqIO->new(-file => ">peread1.fastq" -format => "fastq",);
$seqout2 = Bio::SeqIO->new(-file => ">peread2.fastq" -format => "fastq",);

$seqio_obj = Bio::SeqIO->new(-file => "AIS351_Strin1edit.fastq", -format => "fastq",
                         -alphabet => "dna" );
$seq_obj = $seqio_obj->next_seq;

while ($seq_obj = $seqio_obj->next_seq) { 
    $name = $seq_obj->desc; if($name=~ / 1:/) {$seqout1->write_seq($seq_obj);
     } else { $seqout2->write_seq($seq_obj); 

    }
}

Thanks for your help and your patience with my beginner knowledge.

~Al

toolic · Answer 1 · 2013-12-04T17:26:51.263

2

You need to add commas between all list items in your calls to new. Change:

$seqout1 = Bio::SeqIO->new(-file => ">peread1.fastq" -format => "fastq",);
$seqout2 = Bio::SeqIO->new(-file => ">peread2.fastq" -format => "fastq",);

to:

$seqout1 = Bio::SeqIO->new(-file => ">peread1.fastq", -format => "fastq",);
$seqout2 = Bio::SeqIO->new(-file => ">peread2.fastq", -format => "fastq",);

edited Dec 04 '13 at 17:26

answered Dec 04 '13 at 17:21

toolic

57,801
17
75
117

SES · Answer 2 · 2013-12-06T05:20:05.803

I would advise you to not use BioPerl for Fastq data because it is incredibly slow (see my comments below). You can use Pairfq for this task because this is one of the things it was designed for (full disclosure: I'm the author). Here's how it would work:

pairfq splitpairs -i AIS351_Strin1edit.fastq -f AIS351_Strin1edit_1.fastq -r AIS351_Strin1edit_2.fastq

In my benchmarks, this is about 300X faster than doing the equivalent task with BioPerl. For example, I measured that it takes 465 seconds to read 1 million Fastq records with Bio::SeqIO, while the above code can do it in about 1.5 seconds. If you have 500 million records, that is a difference in 64 hours vs. 11 minutes. That is why it is strongly discouraged to use BioPerl for NGS data. I'm not bashing BioPerl because I use it everyday, but be aware of this issue.

About the error in your comments, the BioPerl parser does not like what is on your '+' line. There must be nothing after the '+' or it must match the sequence header. It's hard to say specifically without seeing the real data, it could also be a line ending problem or something else.

EDIT: You need to put use strict; and use warnings; at the top of every script. Also, it is a good idea to test if a file exists before trying to do anything with it (like trying to read it with BioPerl). About your last question, I suggest you read up on the FASTQ format. You can't just remove lines from the record or else it won't be valid FASTQ. A minor point is you don't need to use Bio::SeqIO::fastq; because Bio::SeqIO will handle loading the appropriate class.

What you posted does not look like real data so it's not easy to say what is causing the problem.

Thank you for this. As I added in my update, part of why I was writing this was to improve my programming skills, and because I had a problem that needed solving. I will likely continue trying to make my own code work, or at least run without error messages. But I will also likely use your Pairfq, as I do need reliable split reads for some genome gap closing software. — user2852013, Dec 06 '13 at 03:15
@user2852013 I updated my post to address your questions. In summary, you need to improve your script by following best practices and show a snippet of real data that is causing the problem. — SES, Dec 06 '13 at 05:24

score 0 · Answer 3 · answered Dec 04 '13 at 17:17

You could achieve what you're after with this snippet:

#!/usr/bin/perl
use warnings;
use strict; 

my @array = ('@M00763:6:000000000-A1U80:1:1101:12620:1732 1:N:0:1
TTATACTC
+
@A@AA@A@',
'@M00763:6:000000000-A1U80:1:1101:12620:1732 2:N:0:1
T
+
E');

foreach (@array){
        if (/\s+1:/) {
            print "1st pair: $_\n"; # You could redirect this to first.OUTFILE
         }
        if (/\s+2:/) {
            print "2nd pair: $_\n"; # You could redirect this to second.OUTFILE
         }

}

Which prints:

1st pair: @M00763:6:000000000-A1U80:1:1101:12620:1732 1:N:0:1
TTATACTC
+
@A@AA@A@
2nd pair: @M00763:6:000000000-A1U80:1:1101:12620:1732 2:N:0:1
T
+

Unfortunately, this is not a good approach. This type of data typically consists of millions or even billions of lines. Reading that into an array and looping over it is a very bad idea. You are using what was provided by the OP so, in theory it works, but in practice it's just not a good approach. — SES, Dec 04 '13 at 23:31

Bio Perl: code to split paired end data?

3 Answers3