Processing FASTQ files based on mate pair length

Question

The following files are two mates of a paired-end fastq file, I want to separate each fastq based on their length.

mate1.fq:

@SRR127.1
TGGTTATGATGTTTGTGTAGGAATAGAAATTTTGATTAAGATATTAGTGAAATTTGAATGTAGTTTATTTGGAAGTTATGGAGAGTTTATATTGTATTTATGTTTATTGTTGTAGATTTATATTTATGTGTATATATTAGTTTTTTTGTGT
+
ABAAAF4FFFFFGGGGGGFFGGFGHGFGHHHHHGGCFFGHHHHH5FDBED55DGGFEGFHHHGBHDDHHHFF3AB3FFG5CBGBEF5BD5DGFEGHFAGAFEDGHGFHHGHGEFFGFGGHFEGHHFHGBEBGHHHHGHBHHFHHGGFGHH2
@SRR127.2
TATGGTAAGAAAATTGAAAATTATAAAAAATGAAAAATGTTTATTTGATGATTTGAAAAATGATGAAATTATTGAAAAATGTGAAAAATGAGAAATGTATATTGTAGGATTTGGAATATGGTGAGATAAATGAAAATTATAGTAAATG
+
AABAA5@D4@5CFFCA55FFGGHDGFHFFCC45DGFA2FA5DD55AAAA55DDBDEDDBGGFF5BA5DDABF5D5B5FF1ADFB5EDGHFG5@BFBD55D5FFB@@5@GBGEFBGHHGB@DBBFHFBDG3B43FFH@FGFHH?FHHHH

mate2.fq:

@SRR127.1
ACCTATAAAAAAACCATATCAATAACTATAAAATCTTTATAAAATCCCACCCAATTAAAAAAAAATAAATTAATACATATAAAACCTTAAACACATAAAACATAATCACATACTATATAAACAATTACTATCACTACTAAACACCTAATA
+
>AA?AF13B@D@1EFCGGGFFG3EBGHHHBB2FGHHGHGFDGHHDFEGFHGGGHG1FFF1GGCGGGBGHHHHHFHHHHFHEGGFHF0BD1FGHHAGEGHFHHHFGGFHHGHHHFHHGGFHBGHFED1FBGFGFHDGHGHFGG1GB0GFHH
@SRR127.2
CTATTTCTCATTTTTTTATAATTTTCAATTCTCTTACCATATTCCACATCCTACACTAAACATTTCTAAATTTTCCACCTTTTTCTATTTTTCTCACCATATTTCATATCCTAAAAAACATATTCCTCATTTACTATAATTTTCAATTATC
+
11>>AFFDFF3@FFF?EFFGFBGHFDFA33D2FF2GGHFE12DD221AF1F1E1BG1GGBFBGGEGHDAABGAGDFABGG1BBDF12A2@2BG@2@DEFFF2B2@2222BB2211FGEE/11@22B2>1B22F2>GBGBD22BGD2>2B22

I wrote the following code to do this but I get a strange error only for the second file (mate2.fq) while both of them also have 151 bp reads.

#!/usr/bin/perl

use strict;
use warnings;

my @fh;

my $file_name = $ARGV[0];
my $infile    = $ARGV[1];

#convert every 4-line fastq to 1-line
open(FH, "cat '$infile' | awk '{printf \"%s%s\",\$0,(NR%4?FS:RS)}' | ");

while (<FH>) {
  chomp;

  my @line = split(/\s+/, $_);
  my $len  = length($line[1]);

  if ($len >= 100) {

    #print $len,"\n",$_,"\n";
    push @fh, $len;

    if (not defined $fh[$len]) {
      open $fh[$len], '>', "$file_name\_$len";
    }
    print { $fh[$len] } (join("\n", @line), "\n");
  }

}

Error:

Can't use string ("151") as a symbol ref while "strict refs" in use at

How can I process these files?

`push @fh, $len;` doesn't make sense as you're storing plain scalar into an array reserved for filehandles. — mpapec, May 01 '15 at 09:03
@TahmtanEbrahimi I changed the title of your post to make it more meaningful and general for future searches, and preserved the error message in the body of the test. Generally, titles should not be overly long or repeat specific error messages or code - instead, these should be in the body of your post. Please further edit the title so that it makes sense to bioinformaticians if I have made a mistake in the wording. — G. Cito, May 01 '15 at 13:41
While you are not directly using [Bioperl](http://www.bioperl.org/) packages here, I added this tag for the sake of relevance. Some of the existing [BioPerl scripts](https://metacpan.org/release/BioPerl) included in the BioPerl distribution might be helpful for this or other aspects of your work. — G. Cito, May 01 '15 at 13:45

Borodin · Accepted Answer · 2015-05-01T14:45:11.023

As you have read, your problem is because of a spurious push that adds an integer value to the end of the @fh array. I presume you were aiming to extend the array to be long enough to add the new file handle. You can do that by assigning to $#fh, so you would write $#fh = $len if $#fh < $len; however it is unnecessary because Perl will extend arrays automatically for you when you simply assign to an element off the end of the array

I have a couple of comments on your program that I hope you find useful

It is unnecessary and wasteful to shell out to an awk command. Perl is quite capable of doing all that awk can do
If you find yourself writing split /\s+/, $_ then you almost certainly mean just split: the default behaviour is to do split ' ', $_. If you use /\s+/ as the pattern and there happens to be leading whitespace on the string you are splitting, then split will return an empty string as the first item in the list of fields. If you use ' ' instead (a literal single space, not the pattern / /) then this won't happen. In effect, split ' ' is equivalent to /\S+/g
When interpolating variable values within a string it's generally neater to put identifiers inside braces if there is a following character that could be part of the identifer. So "${file_name}_$len" instead of "$file_name\_$len"

This is how I would write your code. It accumulates the input records into $line until four records have been added, and then processes that line as before.

#!/usr/bin/perl

use strict;
use warnings;

my ($file_name, $infile) = @ARGV;

open my $in_fh, '<', $infile or die $!;
my $line;

my @fh;
while ( <$in_fh> ) {
  chomp;
  $line .= $_;

  if ( $. % 4 == 0 or eof ) {

    my @line = split ' ', $line;
    my $len  = length $line[1];
    next if $len < 100;

    open $fh[$len], '>', "${file_name}_$len" unless $fh[$len];
    print { $fh[$len] } "$_\n" for @line;

    $line = undef;
  }
}

Tip: I doubt that most people know that `eof` and `eof()` are different, much less what each does. Best to use `eof($in_fh)`. — ikegami, May 01 '15 at 14:00
@ikegami: I think it is enough to let your comment tell its own tale. To me this is a design mistake, as even `eof()` and `eof ARGV` are different — Borodin, May 01 '15 at 14:39

score 5 · Answer 2 · answered May 01 '15 at 10:08

What this error specifically means is that you're doing something that expects a reference, but it's not getting one.

The line:

print {$fh[$len]} (join("\n",@line),"\n");

Is explicitly printing to a filehandle - from what looks like a list of filehandles called @fh.

This line:

push @fh, $len;

Will be inserting a numeric value into that list. (Presumably $line[1] is 151 characters long). And so you're actually trying to:

 print {151} (join("\n",@line),"\n");

Which hopefully it's pretty obvious - just isn't going to work. You look like you're trying to open a filehandle, and insert it into an array:

open $fh[$len], '>', "$file_name\_$len";

Can I suggest instead that you'd be much better off using a hash for this? Otherwise you've got an array full of empty elements, with one populated.

Where you could instead:

#further up:
my %fh; 


#and then
open ( $fh{$len}, ">", "$file_name\_$len" ) or warn $!;

Don't forget to close your filehandles at the end though:

foreach my $key ( keys %fh ) {
   close ( $fh{$key} );
}

I would also suggest rather than:

open( FH, "cat '$infile' | awk '{printf \"%s%s\",\$0,(NR%4?FS:RS)}' | " );

You'd probably be better off handling that within perl, as all you're doing is parsing a file using an external binary. (And use lexical filehandles: `open ( $input, "-|, "cat '$infile' | awk '{printf \"%s%s\",\$0,(NR%4?FS:RS)}'" ) or warn $!; )

Processing FASTQ files based on mate pair length

2 Answers2