0

My goal is to open a file containing a single column of fixed length (1 character = 2 bytes on my Mac), and then to read the lines of the file into an array, beginning and ending at specified points. The file is very long, so I am using the seek command to jump to the appropriate starting line of the file. The file is a chromosomal sequence, arranged as a single column. I am successfully jumping to the appropriate point in the file, but I am having trouble reading the sequence into the array.

my @seq = (); # to contain the stretch of sequence I am seeking to retrieve from file.
my $from_bytes = 2*$from - 2; # specifies the "start point" in terms of bytes.
seek( SEQUENCE, $from_bytes, 0 );
my $from_base = <SEQUENCE>;
push ( @seq, $from_base ); # script is going to the correct line and retrieving correct base.

my $count = $from + 1; # here I am trying to continue the read into @seq
while ( <SEQUENCE> ) {
        if ( $count = $to ) { # $to specifies the line at which to stop
              last;
        }

        else {
             push( @seq, $_ );
             $count++;
             next;  
        }
}
print "seq is: @seq\n\n"; # script prints only the first base
ldav1s
  • 15,885
  • 2
  • 53
  • 56
ES55
  • 490
  • 1
  • 5
  • 14

1 Answers1

1

It seems you are reading fixed width records, consisting of $to lines, and each line has 2 bytes (1 char + 1 newline). As such you can simply read each chromosome sequence with a single read. A short example:

use strict;
use warnings;
use autodie;

my $record_number    = $ARGV[0];
my $lines_per_record = 4; # change to the correct value
my $record_length    = $lines_per_record * 2;
my $offset           = $record_length * $record_number;

my $fasta_test = "fasta_test.txt";

if (open my $SEQUENCE, '<', $fasta_test) {
    my $sequence_string;
    seek $SEQUENCE, $offset, 0;
    my $chars_read = read($SEQUENCE, $sequence_string, $record_length);
    if ($chars_read) {
        my @seq = split /\n/, $sequence_string; # if you want it as an array
        $sequence_string =~ s/\n//g; # if you want the chromosome sequence as a single string without newlines
        print $sequence_string, "\n";
    } else {
        print STDERR "Failed to read record $record_number!\n";
    }

    close $SEQUENCE;
}

With more information one could probably present a better solution.