3

I'm trying to organize file with multiple sequences . In doing so, I'm trying to add the names to a list and add the sequences to a separate list that is parallel with the name list . I figured out how to add the names to a list but I can't figure out how to add the sequences that follow it into separate lists . I tried appending the lines of sequence into an empty string but it appended all the lines of all the sequences into a single string .

all the names start with a '>'

def Name_Organizer(FASTA,output):

    import os
    import re

    in_file=open(FASTA,'r')
    dir,file=os.path.split(FASTA)
    temp = os.path.join(dir,output)
    out_file=open(temp,'w')

    data=''
    name_list=[]

    for line in in_file:

        line=line.strip()
        for i in line:
            if i=='>':
                name_list.append(line)
                break
            else:
                line=line.upper()
        if all([k==k.upper() for k in line]):
            data=data+line

    print data

how do i add the sequences to a list as a set of strings ?

the input file looks like this

enter image description here

O.rka
  • 29,847
  • 68
  • 194
  • 309

3 Answers3

5

If you're working with Python & fasta files, you might want to look into installing BioPython. It already contains this parsing functionality, and a whole lot more.

Parsing a fasta file would be as simple as this:

from Bio import SeqIO
for record in SeqIO.parse('filename.fasta', 'fasta'):
    print record.id, record.seq
Tim
  • 19,793
  • 8
  • 70
  • 95
1

You need to reset the string when you hit marker lines, like this:

def Name_Organizer(FASTA,output):

    import os
    import re

    in_file=open(FASTA,'r')
    dir,file=os.path.split(FASTA)
    temp = os.path.join(dir,output)
    out_file=open(temp,'w')

    data=''
    name_list=[]
    seq_list=[]

    for line in in_file:

        line=line.strip()
        for i in line:
            if i=='>':
                name_list.append(line)
                if data:
                    seq_list.append(data)
                    data=''
                break
            else:
                line=line.upper()
        if all([k==k.upper() for k in line]):
            data=data+line

    print seq_list

Of course, it might also be faster (depending on how large your files are) to use string joining rather than continually appending:

data = []

# ...

data.append(line) # repeatedly

# ...

seq_list.append(''.join(data)) # each time you get to a new marker line
data = []
Amber
  • 507,862
  • 82
  • 626
  • 550
  • it works ! i'm just confused on the line "if data:" how can the name of a string be an if statement ? – O.rka Mar 04 '12 at 18:53
  • In Python, an empty string is a false value, and a non-empty string is a true value. Thus `if data:` is equivalent to "if data is not empty" – Amber Mar 04 '12 at 18:56
  • @draconisthe0ry, Amber, I feel I should mention that there's something strange about iterating over every character of every line like this. Isn't that unnecessary? Am I missing something? – senderle Mar 04 '12 at 19:01
  • 1
    It is unnecessary, I was just trying to modify the OP's code as little as possible outside the bounds of the question. In general, `if line.startswith('>')` would be a better check. – Amber Mar 04 '12 at 19:04
  • Also, @draconisthe0ry, in the above code, `line = line.upper()` is performed for _every character_. Did you mean the `line = line.upper()` to appear in the `else` clause of the `for` loop? – senderle Mar 04 '12 at 19:11
0

I organized it in a dictionary first

# remove white spaces from the lines
lines = [x.strip() for x in open(sys.argv[1]).readlines()]
fasta = {}
for line in lines:
    if not line:
        continue
    # create the sequence name in the dict and a variable
    if line.startswith('>'):
        sname = line
        if line not in fasta:
            fasta[line] = ''
        continue
    # add the sequence to the last sequence name variable
    fasta[sname] += line
# just to facilitate the input for my function
lst = list(fasta.values())