how to read a fasta file in python?

Question

I'm trying to read a FASTA file and then find specific motif(string) and print out the sequence and number of times it occurs. A FASTA file is just series of sequences(strings) that starts with a header line and the signature for header or start of a new sequence is ">". in a new line immediately after the header is the sequence of letters.I'm not done with code but so far I have this and it gives me this error:

AttributeError: 'str' object has no attribute 'next'

I'm not sure what's wrong here.

import re

header=""
counts=0
newline=""

f1=open('fpprotein_fasta(2).txt','r')
f2=open('motifs.xls','w')
for line in f1:
    if line.startswith('>'):
        header=line
        #print header
        nextline=line.next()
        for i in nextline:
            motif="ML[A-Z][A-Z][IV]R"
            if re.findall(motif,nextline):
                counts+=1
                #print (header+'\t'+counts+'\t'+motif+'\n')
        fout.write(header+'\t'+counts+'\t'+motif+'\n')

f1.close()
f2.close()

Is that an assignment for education or is it for work? Because there are multiple libraries available that already do this. — Lev Levitsky, Dec 14 '13 at 07:40

score 6 · Answer 1 · answered Dec 14 '13 at 07:23

The error is likely coming from the line:

nextline=line.next()

line is the string you have already read, there is no next() method on it.

Part of the problem is that you're trying to mix two different ways of reading the file - you are iterating over the lines using for line in f1 and <handle>.next().

Also, if you are working with FASTA files I recommend using Biopython: it makes working with collections of sequences much easier. In particular, Chapter 14 on motifs will be of particular interest to you. This will likely require that you learn more about Python in order to achieve what you want, but if you're going to be doing a lot more bioinformatics than what your example here shows then it's definitely worth the investment of time.

thanks! yes I'm going to use biopython. however for this assignment I have to use python.what I'm trying to dod is that I need to read the line that follows the header that's why I used .next() but that's obviously wrong! — user3098683, Dec 14 '13 at 08:32

Arnaud P · Answer 2 · 2013-12-14T07:44:23.897

This might help getting you in the right direction

import re

def parse(fasta, outfile):
    motif = "ML[A-Z][A-Z][IV]R"
    header = None
    with open(fasta, 'r') as fin, open(outfile, 'w') as fout:
            for line in fin:
                if line.startswith('>'):
                    if header is not None:
                        fout.write(header + '\t' + str(count) + '\t' + motif + '\n')
                    header = line
                    count = 0
                else:
                    matches = re.findall(motif, line)
                    count += len(matches)
            if header is not None:
                fout.write(header + '\t' + str(count) + '\t' + motif + '\n')
if __name__ == '__main__':
    parse("fpprotein_fasta(2).txt", "motifs.xls")

score 0 · Answer 3 · answered Dec 14 '13 at 07:15

0

I am not sure about the pasta stuff, but I am pretty sure you did wrong here:

nextline=line.next()

line is simply a str, so you can't call str.next()

Also, regarding files, you are recommended to use:

with open('fpprotein_fasta(2).txt','r') as f1:

This will deal with closing the file automatically.

You are encouraged to provide a sample fasta file so that I can try to correct the code.

answered Dec 14 '13 at 07:15

Ray

2,472
18
22

>gi|951040|emb|CAA62241.1| alcohol dehydrogenase [Rattus norvegicus] MGTQGKVITCKAAIAWKTDSPLCIEEIEVSPPKAHEVRIKVIATCVCPTDINATNPKKKALFPVVLGHECAGIVESVGPGVTNFKPGDKVIPFFAPQCKKCKLCLSPLTNLCGKLRNFKYPTIDQELMEDRTSRFTSKERSIYHFMGVSSFSQYTVVSEANLARVDDEANLERVCLIGCGFTSGYGAAINTAKVTPGSACAVFGLGCVGL – user3098683 Dec 14 '13 at 08:33
so this is the example of the format. with header being >gi|951040|balh blah blah and the next line is what I'm trying to find the motif in. – user3098683 Dec 14 '13 at 08:35

score 0 · Answer 4 · answered Jun 29 '22 at 18:36

This is how I load FASTA file to a dictionary:

motifs = dict()

with open('[path to FASTA file]\filename.fna') as f:
lines = f.readlines()
for i in range(0, len(lines)):
    s = lines[i].strip()
    if s[0] == '>':
        key = s[1:]
    else:
        motifs[key] = s

each line starting with '>' character contains the id(key) of the next line.

how to read a fasta file in python?

4 Answers4

Linked