1

I've tested my program on a dozen Windows machines, a half dozen Macs, and a Linux machine and it works without error on both the Windows and Linux but not the Macs. My program is designed to work with protein database files which are text files that range from 250MB to 10GB. I took 1/10th of the 250MB file to make a sample file for debugging purposes but found that the error did not occur with the smaller file.

I've narrowed down the bug to this section of code, in this section $tempFile, is the protein database file:

open(ps_file, "..".$slash."dataset".$slash.$tempFile) 
         or die "couldn't open $tempFile";
while(<ps_file>){
    chomp;


    my @curLine = split(/\t/, $_);
    my $filter = 1;
    if($taxon){
        chomp($curLine[2]);

        print "line2 ".$curLine[2].",\t".$taxR{$curLine[2]}."\n";

        $filter = $taxR{$curLine[2]};
    }
    if($filter){
        checkSeq(@curLine);
    }
}

This is a screenshot of the output of that print statement showing special characters:

output of that print statement showing special characters

This is what the output looks like on a Windows Machine:

output looks like on a Windows Machine

Here is an example of 1 line from the $tempFile

>sp|P48255|ABCX_CYAPA Probable ATP-dependent transporter ycf16 OS=Cyanophora paradoxa GN=ycf16 PE=3 SV=1 MSTEKTKILEVKNLKAQVDGTEILKGVNLTINSGEIHAIMGPNGSGKSTFSKILAGHPAYQVTGGEILFKNKNLLELEPEERARAGVFLAFQYPIEIAGVSNIDFLRLAYNNRRKEEGLTELDPLTFYSIVKEKLNVVKMDPHFLNRNVNEGFSGGEKKRNEILQMALLNPSLAILDETDSGLDIDALRIVAEGVNQLSNKENSIILITHYQRLLDYIVPDYIHVMQNGRILKTGGAELAKELEIKGYDWLNELEMVKK CYAPA

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
Dave D
  • 35
  • 6

1 Answers1

3

The problem probably lies in inconsistent line-endings. If, as I suspect, trailing whitespace is not significant, you're better off removing that instead of chomping.

Also note:

  • Bareword filehandles such as ps_file are package global variables that are subject to action at a distance, use lexical filehandles.

  • Use File::Spec or Path::Class to handle file paths in a platform independent way.

  • Include full file paths and error message if there is an error opening a file.

  • In

    chomp;
    
    my @curLine = split(/\t/, $_);
    my $filter = 1;
    if($taxon){
        chomp($curLine[2]);
    

$curLine[2] comes from a string that was read in as a line and chomped. I don't see why you are chomping that again.

Here's tidied up version of your code-snippet:

use File::Spec::Functions qw( catfile );

my $input_file = catfile('..', dataset => $tempFile);


open my $ps_file, '<', $input_file
    or die "couldn't open '$input_file': $!";

while (my $line = <$ps_file>) {
    $line =~ s/\s+\z//; # remove all trailing space

    my @curLine = split /\t/, $line;

    my $filter = 1;
    if ($taxon) {
        my $field = $curLine[2];
        $filter = $taxR{ $field };

        print join("\t", "line2 $field", $filter), "\n";
    }
    if ($filter) {
        checkSeq(@curLine);
    }
}
Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
  • that fixed it with the 250MB file I'm going to test it with the 10GB file now but that will take 30minutes to run thank you so much – Dave D Jun 28 '12 at 15:23