-1

I wrote a PERL program which takes an excel sheet (coverted to a text file by changing the extension from .xls to .txt) and a sequence file for its input. The excel sheet contains the start point and the end point of an area in the sequence file (along with 70 flanking values on either side of the match area) that needs to cut and extracted into a third output file. There are like 300 values. The program reads in the start point and the end point of the sequence that needs to be cut each time but it repeatedly tells me that the value is outside the length on the input file when it clearly isn't. I just cant seem to get this fixed

This is the program

use strict;
use warnings;

my $blast;
my $i;
my $idline;
my $sequence;
print "Enter Your BLAST result file name:\t";
chomp( $blast = <STDIN> );    # BLAST result file name
print "\n";

my $database;
print "Enter Your Gene list file name:\t";
chomp( $database = <STDIN> );    # sequence file
print "\n";

open IN, "$blast" or die "Can not open file $blast: $!";

my @ids       = ();
my @seq_start = ();
my @seq_end   = ();

while (<IN>) {

    #spliting the result file based on each tab
    my @feilds = split( "\t", $_ );
    push( @ids, $feilds[0] );    #copying the name of sequence
         #coping the 6th tab value of the result which is the start point of from where a value should be cut.
    push( @seq_start, $feilds[6] );
    #coping the 7th tab value of the result file which is the end point of a value should be cut.
    push( @seq_end, $feilds[7] );
}
close IN;

open OUT, ">Result.fasta" or die "Can not open file $database: $!";

for ( $i = 0; $i <= $#ids; $i++ ) {

    ($sequence) = &block( $ids[$i] );

    ( $idline, $sequence ) = split( "\n", $sequence );

    #extracting the sequence from the start point to the end point
    my $seqlen = $seq_end[$i] - $seq_start[$i] - 1;

    my $Nucleotides = substr( $sequence, $seq_start[$i], $seqlen );  #storing the extracted substring into $sequence

    $Nucleotides =~ s/(.{1,60})/$1\n/gs;

    print OUT "$idline\n";
    print OUT "$Nucleotides\n";
}
print "\nExtraction Completed...";

sub block {
    #block for id storage which is the first tab in the Blast output file.
    my $id1 = shift;
    print "$id1\n";
    my $start = ();

    open IN3, "$database" or die "Can not open file $database: $!";

    my $blockseq = "";
    while (<IN3>) {

        if ( ( $_ =~ /^>/ ) && ($start) ) {

            last;
        }

        if ( ( $_ !~ /^>/ ) && ($start) ) {

            chomp;
            $blockseq .= $_;
        }

        if (/^>$id1/) {

            my $start = $. - 1;
            my $blockseq .= $_;
        }
    }
    close IN3;

    return ($blockseq);
}

BLAST RESULT FILE: http://www.fileswap.com/dl/Ws7ehftejp/

SEQUENCE FILE: http://www.fileswap.com/dl/lPwuGh2oKM/

Error

substr outside of string at Nucleotide_Extractor.pl line 39.

Use of uninitialized value $Nucleotides in substitution (s///) at Nucleotide_Extractor.pl line 41.

Use of uninitialized value $Nucleotides in concatenation (.) or string at Nucleotide_Extractor.pl line 44.

Any help is very much appreciated and queries are always invited

Miller
  • 34,962
  • 4
  • 39
  • 60
The Last Word
  • 203
  • 1
  • 7
  • 24
  • What is in phytophthora file?. I couldn't process the block function without it. You appear to take substring with starting index outside the string length like substr("Hello",45,4). Since it returns nothing $Nucleotides is also uninitialised. I would suggest you check the index of substr. – xtreak Sep 19 '14 at 05:42
  • @Wordzilla That is the sequence file name to which the link I have provided in the question. I have uploaded both the input files in fileswap and provided the links. Please download both the files and process it. The sequence belongs to an organism named Phytophthora. I changed the file name now. Thank you – The Last Word Sep 19 '14 at 05:56
  • You should also `use strict` on your script and declare all your variables using `my` -- i.e. `my $sequence = ...`. – i alarmed alien Sep 19 '14 at 07:04
  • @ialarmedalien Did, still doesn't work. I am changing the code here. It is some error to do with the substr which I just cant get my finger on. The input files and the code is available. If possible, please run it on your system. – The Last Word Sep 19 '14 at 07:18
  • Can you add some comments to show what you're trying to do at each step. It's very hard to work out the program logic... – i alarmed alien Sep 19 '14 at 07:20
  • @ialarmedalien Gimme 10 minutes – The Last Word Sep 19 '14 at 07:21
  • @ialarmedalien better? Please download the input files and run the code. Might get a better idea – The Last Word Sep 19 '14 at 07:32
  • What are you trying to do in the `block` subroutine? Do the fasta files you're reading in usually have more than one sequence per file? Have you checked whether $sequence is set after running `block`? (hint: it isn't) – i alarmed alien Sep 19 '14 at 07:50

1 Answers1

2

There were several problems with the existing code, and I ended up rewriting the script while fixing the errors. Your implementation isn't very efficient as it opens, reads, and closes the sequence file for every ID in your Excel sheet. A better approach would be either to read and store the data from the sequence file, or, if memory is limited, go through each entry in the sequence file and pick out the corresponding data from the Excel file. You would also be better off using hashes, instead of arrays; hashes store data in key -- value pairs, so it is MUCH easier to find what you're looking for. I have also used references throughout, as they make it easy to pass data into and out of subroutines.

If you are not familiar with perl data structures, check out perlfaq4 and perldsc, and perlreftut has information on using references.

The main problem with your existing code was that the subroutine to get the sequence from the fasta file was not returning anything. It is a good idea to put plenty of debugging statements in your code to ensure that it is doing what you think it is doing. I've left in my debugging statements but commented them out. I've also copiously commented the code that I changed.

#!/usr/bin/perl
use strict;
use warnings;
# enables 'say', which prints out your text and adds a carriage return
use feature ':5.10';
# a very useful module for dumping out data structures
use Data::Dumper;

#my $blast = 'infesmall.txt';
print "Enter Your BLAST result file name:\t";
chomp($blast = <STDIN>);     # BLAST result file name
print "\n";

#my $database = 'infe.fasta';
print "Enter Your Gene list file name:\t";
chomp($database = <STDIN>);  # sequence file
print "\n";

open IN,"$blast" or die "Can not open file $blast: $!";

# instead of using three arrays, let's use a hash reference!
# for each ID, we want to store the start and the end point. To do that,
# we'll use a hash of hashes. The start and end information will be in one
# hash reference:
# { start => $fields[6], end => $fields[7] }
# and we will use that hashref as the value in another hash, where the key is
# the ID, $fields[0]. This means we can access the start or end data using
# code like this:
#   $info->{$id}{start}
#   $info->{$id}{end}
my $info;

while(<IN>){
    #splitting the result file based on each tab
    my @fields = split("\t",$_);
    # add the data to our $info hashref with the ID as the key:
    $info->{ $fields[0] } = { start => $fields[6], end => $fields[7] };
}
close IN;

#say "info: " . Dumper($info);

# now read the sequence info from the fasta file
my $sequence = read_sequences($database);
#say "data from read_sequences:\n" . Dumper($sequence);

my $out = 'result.fasta';
open(OUT, ">" . $out) or die "Can not open file $out: $!";

foreach my $id (keys %$info) {

    # check whether the sequence exists
    if ($sequence->{$id}) {
        #extracting the sequence from the start point to the end point
        my $seqlen = $info->{$id}{end} - $info->{$id}{start} - 1;

        #say "seqlen: $seqlen; stored seq length: " . length($sequence->{$id}{seq}) . "; start: " . $info->{$id}{start} . "; end: " . $info->{$id}{end};

        #storing the extracted substring into $sequence
        my $nucleotides = substr($sequence->{$id}{seq}, $info->{$id}{start}, $seqlen);
        $nucleotides =~ s/(.{1,60})/$1\n/gs;
        #say "nucleotides: $nucleotides";
        print OUT $sequence->{$id}{header} . "\n";
        print OUT "$nucleotides\n";
    }
}
print "\nExtraction Completed...";

sub read_sequences {
    # fasta file
    my $fasta_file = shift;

    open IN3, "$fasta_file" or die "Can not open file $fasta_file: $!";

    # initialise two variables. We will store our sequence data in $fasta
    # and use $id to track the current sequence ID
    # the $fasta hash will look like this:
    # $fasta = {
    #   'gi|7212472|ref|NC_002387.2' => {
    #       header => '>gi|7212472|ref|NC_002387.2| Phytophthora...',
    #       seq => 'ATAAAATAATATGAATAAATTAAAACCAAGAAATAAAATATGTT...',
    #   }
    #}

    my ($fasta, $id);

    while(<IN3>){
        chomp;
        if (/^>/) {
            if (/^>(\S+) /){
                # the header line with the sequence info.
                $id = $1;
                # save the data to the $fasta hash, keyed by seq ID
                # we're going to build up an entry as we go along
                # set the header to the current line
                $fasta->{ $id }{ header } = $_;
            }
            else {
                # no ID found! Erk. Emit an error and undef $id.
                warn "Formatting error: $_";
                undef $id;
            }
        }
        ## ensure we're getting sequence lines...
        elsif (/^[ATGC]/) {
            # if $id is not defined, there's something weird going on, so
            # don't save the sequence. In a correctly-formatted file, this
            # should not be an issue.
            if ($id) {
                # if $id is set, add the line to the sequence.
                $fasta->{ $id }{ seq } .= $_;
            }
        }
    }
    close IN3;
    return $fasta;
}
i alarmed alien
  • 9,412
  • 3
  • 27
  • 40