0

I wrote a script that uses a GenBank file and Biopython to fetch the sequences of given genes from the sequence part of the GBK file, which my colleagues use for their work.

We had some problems now with a new data set, and it turned out that the GBK file that was downloaded did not contain a sequence (which can easily happen when you download from the GenBank website at NCBI). Instead of throwing an error, Biopython returns a long sequence of Ns when using record.seq[start:end]. What is the easiest way to catch that problem right from the start to stop the script with an error message?

MattDMo
  • 100,794
  • 21
  • 241
  • 231
Lilith-Elina
  • 1,613
  • 4
  • 20
  • 31
  • 1
    Please [edit] your question with an [MCVE](http://stackoverflow.com/help/mcve) showing what you are doing now. Please include 2 inputs - a normal GenBank file (containing a sequence) and one that does not contain a sequence. Without knowing exactly what you're doing, it's difficult to give advice on how to change it. – MattDMo Aug 28 '14 at 15:54
  • Sorry, I thought people who have worked with GenBank and BioPython before wouldn't need input files. I should have included code, of course. By now, I've found my own solution, so I hope you don't mind if I don't try to find a way to provide files on here. Anyway, thanks, the solution was a lot easier to find while trying to prepare a MCVE. – Lilith-Elina Sep 01 '14 at 11:14

1 Answers1

1

Right, I found a way. If I count the Ns in the sequence and check if there are as many as the sequence is long, I know that the sequence is missing:

import sys
from Bio import SeqIO    

for seq_record in SeqIO.parse("sequence.gb", "genbank"):
  sequence = seq_record.seq
  if len(sequence) == sequence.count("N"):
    sys.exit("There seems to be no sequence in your GenBank file!")

I would have preferred a solution that checks the sequence type instead, since the empty sequence is Bio.Seq.UnknownSeq, instead of Bio.Seq.Seq for a real sequence, and would be thankful if anyone can suggest something in that direction.

Update

@xbello made me try again to check the sequence type, now this also works:

import sys, Bio
from Bio import SeqIO    

for seq_record in SeqIO.parse("sequence.gb", "genbank"):
  sequence = seq_record.seq
  if isinstance(sequence, Bio.Seq.UnknownSeq):
    sys.exit("There seems to be no sequence in your GenBank file!")
Lilith-Elina
  • 1,613
  • 4
  • 20
  • 31
  • When you say "no sequence" or "sequence missing" do you mean lots of "NNNNNNNNN" or just an empty sequence? Could you give some sample accession? If you just want to check if a Seq is a `Bio.Seq.UnknownSeq` you can use something like `if isinstance(sequence, Bio.Seq.UnknownSeq)`. – xbello Sep 02 '14 at 08:59
  • If you go to GenBank to download something (e.g. http://www.ncbi.nlm.nih.gov/nuccore/U22660.1) and un-check the "Show sequence" box under Customize view on the right before doing so, the file will not contain anything below the ORIGIN feature. Nevertheless, if you read that file with `SeqIO.parse()` and then try to extract the sequence, Biopython will give you as many Ns as the sequence is supposed to be long. – Lilith-Elina Sep 02 '14 at 09:10
  • So far, `isinstance()` didn't work for me. It seems that was because I always only did `from Bio import SeqIO` and never actually imported `Bio` as such. Now it works, thanks. :-) – Lilith-Elina Sep 02 '14 at 09:11
  • It's better to import UnknownSeq with `from Bio.Seq import UnknownSeq`, and then use only `isinstance(sequence, UnknownSeq)`. – xbello Sep 02 '14 at 12:41