Parsing huge FASTA file

Question

I have a FASTA file and it is a huge file I want to take those sequences which has Homo sapiens. There are methods like dictionary and list where we can use to get the results. But because of the huge size we cannot use memory. We have to write the results to file. My sample FASTA file is as follows

gi|489223532|ref|WP_003131952.1| 30S ribosomal protein S18 [Lactococcus lactis] MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKRFISERGKILPRRVTGTSAKNQRKVVNAIKRARVMALLPFVAEDLTRYYDG

gi|66816243|ref|XP_642131.1| hypothetical protein DDB_G0277827 [Homo sapiens] MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ

gi|66818355|ref|XP_642837.1| hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4] MKTKSSNNIKKIYYISSILVGIYLCWQIIIQIIFLMDNSIAILEAIGMVVFISVYSLAVAINGWILVGRMKKSSKKAQYEDFYKKMILKSKILLSTIIIVIIVVVVQDIVINFILPQNPQPYVYMIISNFIVGIADSFQMIMVIFVMGELSFKNYFKFKRIEKQKNHIVIGGSSLNSLPVSLPTVKSNESNESNTISINSENNNSKVSTDDTINNVM

gi|446106212|ref|WP_000184067.1| MULTISPECIES: antibiotic transporter [Homo sapiens] MTNPFENDNYTYKVLKNEEGQYSLWPAFLDVPIGWNVVHKEASRNDCLQYVENNWEDLNPKSNQVGKKILVGKR

gi|494110381|ref|WP_007051162.1| MULTISPECIES: argininosuccinate lyase [Bifidobacterium] MTENNEHLALWGGRFTSGPSPELARLSKSTQFDWRLADDDIAGSRAHARALGRAGLLTADELQRMEDALDTLQRHVDDGSFAPIEDDEDEATALERGLIDIAGDELGGKLRAGRSRNDQIACLIRMWLRRHSRVIAGLLLDLVNALIEQSEKAGRTVMPGRTHMQHAQPVLLAHQLMAHAWPLIRDVQRLIDWDKRINASPYGSGALAGNTLGLDPEAVARELGFIDGAD

Expected output

gi|66816243|ref|XP_642131.1| hypothetical protein DDB_G0277827 [Homo sapiens] MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ

gi|446106212|ref|WP_000184067.1| MULTISPECIES: antibiotic transporter [Homo sapiens] MTNPFENDNYTYKVLKNEEGQYSLWPAFLDVPIGWNVVHKEASRNDCLQYVENNWEDLNPKSNQVGKKILVGKR

Read the file line by line and output it to a file if it contains your String — BlueMoon93, Jan 19 '16 at 15:50
If this is not a school assignment, I'd recommend using existing FASTA parsers to save time. Check BioPython or pyteomics (I wrote the latter) for iterative parsers to build upon. — Lev Levitsky, Jan 19 '16 at 15:52

piman314 · Answer 1 · 2016-01-19T16:56:07.363

1

You should be showing an effort in your question as you clearly haven't tried. I'm only answering because it's 3 lines.

for line in f:
    if('Homo sapiens' in line):
        print line+'\n'

EDIT

If there is a new line after the header information, then you will require a more clunky piece of code, but it'll get through the file quickly still.

f = open('/Users/nfirth/Downloads/file.fasta')
swapLine = False
for line in f:
    if(swapLine):
        line = line2
        swapLine = False
    if('Homo sapien' in line):
        print line,
        line2 = f.next()
        while('>' not in line2):
            print line2,
            line2 = f.next()
        swapLine = True
f.close()

edited Jan 19 '16 at 16:56

answered Jan 19 '16 at 16:06

piman314

5,285
23
35

I need to get the sequence as well not only the header information – sandeep kasaragod Jan 19 '16 at 16:20
In the input provided, there are no new lines at the end of the header information. Consequently the provided code generates the expected output. I will edit my answer, next time be careful with what you ask. – piman314 Jan 19 '16 at 16:39

score 0 · Answer 2 · answered Feb 12 '16 at 14:30

from Bio import SeqIO

my_data = []
with open("test.fasta", "r") as handle:
     for record in SeqIO.parse(handle, 'fasta'):
          if 'Homo sapiens' in record.name:
               my_data.append(str(record.seq))

with open("output.fasta", "w") as out:
     for item in my_data:
          out.write("{0}\n===End===\n".format(item))

Parsing huge FASTA file

2 Answers2