1

I have a file that looks like this:

>sequence_name_16hj51
CAACCTTGGCCAT
>sequence_name_158ghni52
AATTGGCCTTGGA
>sequence_name_468rth
AAGGTTCCA

I would like to obtain this: ['CAACCTTGGCCAT', 'AATTGGCCTTGGA', 'AAGGTTCCA']

I have a list with all the sequence names titled title_finder. When I try to use:

for i in range(0,len(title_finder)):
    seq = seq.split(title_finder[i])
    print seq

I get this traceback:

Traceback (most recent call last):
  File "D:/Desktop/Python/consensus new.py", line 23, in <module>
    seq = seq.split(title_finder[i])
AttributeError: 'list' object has no attribute 'split'

Can somebody help me out?

EDIT: Sometimes some sequences span multiple lines and so I get more than one string when I do it with a for loop.

BioGeek
  • 21,897
  • 23
  • 83
  • 145
LilyJones
  • 23
  • 1
  • 1
  • 6
  • You can only split a string, and you get a list. Your loop splits repeatedly, so you fail after the first go-round. – alexis Sep 20 '15 at 19:18
  • use BioPython http://stackoverflow.com/questions/31265282/how-to-randomly-extract-fasta-sequences-using-python/31265485#31265485 – Padraic Cunningham Sep 20 '15 at 19:54

4 Answers4

4

If you're doing bioinformatics, you should really consider installing BioPython.

from Bio import SeqIO
with open('your_file.fasta') as f:
    return [str(record.seq) for record in SeqIO.parse(f, "fasta")]

If you want to do it in pure Python, then this wil work:

with open('your_file.fasta') as f:
    print [line.rstrip() for line in f if not line.startswith('>')]
BioGeek
  • 21,897
  • 23
  • 83
  • 145
  • 1
    I second the use of Biopython, it handles FASTA files for you and does a lot of the dirty work. You can always convert it to a string if you really need a string. – Chris Chambers Sep 20 '15 at 21:55
1

You are trying to split a list which gave you that AttributeError, instead of that you can read your file line and check if the line doesn't starts with > then preserve it.

With open('file_nam') as f:
    my_patterns=[line.rstrip() for line in f in not line.startswith('>')]   

Also as an alternative and pythonic way if you are sure that the patterns are in odd lines you can use itertools.islice to slice your file object :

from itertools import islice
With open('file_nam') as f:
     my_my_patterns=list(islice(f,1,None,2))

And note that if you just want to loop over your patterns you don't need to convert the result of islice to list you can simply iterate over your iterator.

Mazdak
  • 105,000
  • 18
  • 159
  • 188
0

assume your file is seq.in, then you can do this to get your list:

In [17]: with open ('seq.in','r') as f:
          extracted_list=[line[:-1] for line in f if line[0]!='>']

In [18]: extracted_list
Out[18]: ['CAACCTTGGCCAT', 'AATTGGCCTTGGA', 'AAGGTTCCA']
Iman Mirzadeh
  • 12,710
  • 2
  • 40
  • 44
0
line = ""

import re

with open('test') as f:
  lines = [line.rstrip()  for line in f if not re.search('sequence_name', line)]

print(lines)

['CAACCTTGGCCAT', 'AATTGGCCTTGGA', 'AAGGTTCCA']

LetzerWille
  • 5,355
  • 4
  • 23
  • 26