1

I am trying to add more than 70000 new features to a genbank file using biopython.

I have this code:

from Bio import SeqIO
from Bio.SeqFeature import SeqFeature, FeatureLocation

fi = "myoriginal.gbk"
fo = "mynewfile.gbk"

for result in results:
     start = 0
     end = 0

     result = result.split("\t")
     start = int(result[0])
     end = int(result[1])

     for record in SeqIO.parse(original, "gb"):
         record.features.append(SeqFeature(FeatureLocation(start, end), type = "misc_feat"))
         SeqIO.write(record, fo, "gb")

Results is just a list of lists containing the start and end of each one of the features I need to add to the original gbk file.

This solution is extremely costly for my computer and I do not know how to improve the performance. Any good idea?

Mastodon
  • 131
  • 12
  • What is `results` in your code? Besides that, for what I see, it's very costly to parse `original` each iteration within the for loopin the `SeqIO.parse(original, "gb")`. By `original` you mean `fi` variable? – cnluzon Jul 22 '15 at 10:45

1 Answers1

1

You should parse the genbank file just once. Omitting what results contains (I do not know exactly, because there are some missing pieces of code in your example), I would guess something like this would improve performance, modifying your code:

fi = "myoriginal.gbk"
fo = "mynewfile.gbk"

original_records = list(SeqIO.parse(fi, "gb"))

for result in results:
    result = result.split("\t")
    start = int(result[0])
    end = int(result[1])

    for record in original_records:
        record.features.append(SeqFeature(FeatureLocation(start, end), type = "misc_feat"))
        SeqIO.write(record, fo, "gb")
cnluzon
  • 1,054
  • 1
  • 11
  • 22