Extracting DNA sequences from FASTA file with BioPerl with non-standard header

Question

I'm trying to extract sequences from a database using the following code:

use strict;
use Bio::SearchIO; 
use Bio::DB::Fasta;


my ($file, $id, $start, $end) = ("secondround_merged_expanded.fasta","C7136661:0-107",1,10);
my $db = Bio::DB::Fasta->new($file);
my $seq = $db->seq($id, $start, $end);
print $seq,"\n";

Where the header of the sequence I'm trying to extract is: C7136661:0-107, as in the file:

>C7047455:0-100
TATAATGCGAATATCGACATTCATTTGAACTGTTAAATCGGTAACATAAGCAGCACACCTGGGCAGATAGTAAAGGCATATGATAATAAGCTGGGGGCTA

The code works fine when I switch the header to something more standard (like test). I'm thinking that BioPerl doesn't like the non-standard heading. Any way to fix this so I don't have to recode the FASTA file?

Besides from your file not looking like a standard fasta file (it's missing the `>` at the beginning of the header line), what exactly does not work with your code? — mpe, Dec 04 '12 at 17:12
sorry, for whatever reason the stack overflow markup removed the > at the beginning of the header line. what happpens is that the code does not return the sequence. If i switch the header to something like >test, and use that as the $id field, it works fine! very confusing! — jasongallant, Dec 04 '12 at 17:34
I fixed the markup; can you confirm that the TATAAT... is actually on a separate line? (I'm bio-ignorant) — ysth, Dec 04 '12 at 17:43
Isn't clear from your question if you want to extract the fasta non-standard header of each sequence, or extract the actual sequence. May you clarify? — Hernán, Dec 04 '12 at 17:52
Thanks @ ysth for fixing! @ Hernan I want to extract the sequence based on the non-standard header. — jasongallant, Dec 04 '12 at 18:02

score 3 · Accepted Answer · answered Dec 04 '12 at 20:57

3

By default, Bio::DB::Fasta will use all non-space characters immediately following the > on the header line to form the key for the sequence. In your case this looks like C7047455:0-100, which is the same as the built-in abbreviation for a subsequence. As documented here, instead of $db->seq($id, $start, $stop) you can use $db->seq("$id:$start-$stop"), so a call to $db->seq('C7136661:0-107') looks like you are asking for $db->seq('C7136661', 0, 107), and that key doesn't exist.

I have no way of knowing what is in your data, but if it is adequate to use just the first part of the header up to the colon as a key then you can use the -makeid callback to modify the key. Then if you use just C7136661 to retrieve the sequence it will work.

This code demonstrates. Note that you will probably already have a .index cache file that you must delete before you see any change in behaviour.

use strict;
use warnings;

use Bio::DB::Fasta;

my ($file, $id, $start, $end) = qw(
  secondround_merged_expanded.fasta
  C7136661
  1 10
);

my $db = Bio::DB::Fasta->new($file, -makeid => \&makeid);

sub makeid {
  my ($head) = @_;
  $head =~ /^>([^:]+)/ or die qq(Invalid header "$head");
  $1;
}

my $seq = $db->seq($id, $start, $end);
print $seq, "\n";

answered Dec 04 '12 at 20:57

Borodin

126,100
9
70
144

Wow- thanks for the detailed response. This is helpful to understand where this error is coming from. However, I need to use the whole header as the key, because the fasta file contains many unique subsamples of the same chromosome... any ideas? – jasongallant Dec 04 '12 at 21:01
Well you have to rewrite the keys so that they don't look like the module's abbreviations. The regex the module uses is `/^(.+):([\d_]+)(?:,|-|\.\.)([\d_]+)$/` so you have to avoid matching that. Changing the colon to a hyphen giving `C7136661-0-107` would work, or changing the hyphen to a tilde, giving `C7136661:0~107`. Take your pick. It seems a little short-sighted of the author not to allow for disabling this in the module tbh. It's not non-standard FASTA as you can put anything you like after the `>`. – Borodin Dec 04 '12 at 21:25
@ChrisCharley: That tutorial shows how to read sequences sequentially from a file. The OP is trying to open the file as a database and retrieve sequences from it using the sequence IDs as keys – Borodin Dec 04 '12 at 22:41
Yes - I deleted the comment when I realized that (upon closer reading). – Chris Charley Dec 04 '12 at 22:44
1

@jasongallant: Another thought occurs to me. Since the module's `seq` method uses all the characters in a compound key up to before the *last* colon as an ID, if you always use the abbreviated form with start and end positions you can keep the IDs as they are. What I mean is that if you write `$db->seq('C7136661:0-107:1-10')` you will get the result you expect without any of the key conversion. This would appear as `$db->seq("$id:$start-$end")` in the above code. You would need to comment it carefully though as otherwise it would be a maintenance nightmare. – Borodin Dec 06 '12 at 05:02

score 0 · Answer 2 · answered Feb 04 '14 at 23:11

0

I have related question to this post. I was wondering if anyone has tried what happens when the position in the query is beyond the outside the limit of the fasta position. So lets say, the fasta contains 100 bases and you query contains position 102, does this method trap the error. I tried this in some real data and it appears to always return "1", however, my fasta sequences contains 0/1 and so it is hard to understand if this is an error code/ it is returning the output for the wrong base.

I tried looking in the documentation but could not find anything.

answered Feb 04 '14 at 23:11

user19758

131
1
5

Please make a new question to ask something. You can mention this question in your question. – yǝsʞǝla Feb 04 '14 at 23:31
1

Thanks, I was hoping that someone on this thread might have a suggestion. But posted as new question. – user19758 Feb 04 '14 at 23:46

Extracting DNA sequences from FASTA file with BioPerl with non-standard header

2 Answers2