0

I have been using the module use Bio::DB::Fasta to access fasta files (documentation here:https://metacpan.org/pod/Bio::DB::Fasta#OBJECT-METHODS). I find that this is much fasta than using Samtools for extracting positions from a fasta file. However, I was wondering if anyone knows what happens if a query includes a position beyond the max length of the fasta.

Today, in a query, I tried accessing a position in the fasta which is beyond the maximum position in the fasta. However, the method did not give an error in this case. My fasta file contains 0/1 bases and the output returned was "1". I was wondering if this is an error or in fact it is giving a valid output but for the wrong position. I tried looking through the documentation but could not find any information about the error codes.

My code is as follows:

use strict;
use warnings;
use Bio::DB::Fasta;

my $maskFile = "1KG_maskfile.fa";

my $db = Bio::DB::Fasta->new($maskFile);

my $chrom = "chr1";
my $start = 300240548;
my $end = 300240548;
my $query = "$chrom:$start-$end"; 
my $seq = $db->seq($query, $start, $end); # also tried $seq = $db->seq($query); 
print $seq, "\n";

Note: In the 1KG_maskfile.fa, the max position is 249224750 (based on character count, excluding header).

user19758
  • 131
  • 1
  • 5
  • I am curious why are you trying access a position in the fasta which is beyond the maximum position. Human chr1 is 248,956,422 nt long. What kind of useful error/warning you are expecting. When chr1 size updated in the reference n+1 or n-1 for example, would someone need to update the warning? – Supertech Jan 02 '23 at 18:25

1 Answers1

0

There are two issues I see here.The first one is that you are not formatting the query ID correctly, unless you have the start/end positions in the Fasta header (which would be odd). To get the sequence you want by region, just specify the specific ID and the coordinates, i.e.,

my $seq = $db->seq('chr1', 25000, 27000);

The other issue you mentioned looks like a bug. I don't think there are any explicit checks if the start/stop positions are beyond the actual sequence length. I just tested it and the method failed silently. There are a lot of other format checks in that code, this may be a good thing to report as a bug.

SES
  • 850
  • 1
  • 9
  • 21
  • Actually, the query is supposed to be very flexible and doing my $seq = $db->seq($query, $start, $end) should give the same answer as your query. I have empirically tested the method using 1KG data and comparing it with samtools. However, the program gives the wrong answer if you wrong outside the boundaries of the fasta. I emailed the author and hopefully the fix should be relatively easy to implement. In the meantime, I found FaSlice which performs the same function. http://search.cpan.org/~ajpage/Bio-Pipeline-Comparison-1.123050/lib/FaSlice.pm – user19758 Feb 07 '14 at 03:22
  • 1
    @user19758, what do you mean "gives the wrong answer?" I don't know what to do with that information. Please show an example. Also, if you know you are specifying a query outside of the range of the sequence, what is the point of this exercise? Lastly, you say the query "should be flexible." I'm not sure what you mean my that. The query is not flexible, it is the sequence ID you want to query. – SES Feb 07 '14 at 16:38