Extracting sequences in Python

Question

I have a file that looks like this:

>sequence_name_16hj51
CAACCTTGGCCAT
>sequence_name_158ghni52
AATTGGCCTTGGA
>sequence_name_468rth
AAGGTTCCA

I would like to obtain this: ['CAACCTTGGCCAT', 'AATTGGCCTTGGA', 'AAGGTTCCA']

I have a list with all the sequence names titled title_finder. When I try to use:

for i in range(0,len(title_finder)):
    seq = seq.split(title_finder[i])
    print seq

I get this traceback:

Traceback (most recent call last):
  File "D:/Desktop/Python/consensus new.py", line 23, in <module>
    seq = seq.split(title_finder[i])
AttributeError: 'list' object has no attribute 'split'

Can somebody help me out?

EDIT: Sometimes some sequences span multiple lines and so I get more than one string when I do it with a for loop.

You can only split a string, and you get a list. Your loop splits repeatedly, so you fail after the first go-round. — alexis, Sep 20 '15 at 19:18
use BioPython http://stackoverflow.com/questions/31265282/how-to-randomly-extract-fasta-sequences-using-python/31265485#31265485 — Padraic Cunningham, Sep 20 '15 at 19:54

BioGeek · Answer 1 · 2015-09-20T19:27:08.230

4

If you're doing bioinformatics, you should really consider installing BioPython.

from Bio import SeqIO
with open('your_file.fasta') as f:
    return [str(record.seq) for record in SeqIO.parse(f, "fasta")]

If you want to do it in pure Python, then this wil work:

with open('your_file.fasta') as f:
    print [line.rstrip() for line in f if not line.startswith('>')]

edited Sep 20 '15 at 19:27

answered Sep 20 '15 at 19:19

BioGeek

21,897
23
83
145

1

I second the use of Biopython, it handles FASTA files for you and does a lot of the dirty work. You can always convert it to a string if you really need a string. – Chris Chambers Sep 20 '15 at 21:55

Mazdak · Answer 2 · 2015-09-20T19:37:53.783

1

You are trying to split a list which gave you that AttributeError, instead of that you can read your file line and check if the line doesn't starts with > then preserve it.

With open('file_nam') as f:
    my_patterns=[line.rstrip() for line in f in not line.startswith('>')]

Also as an alternative and pythonic way if you are sure that the patterns are in odd lines you can use itertools.islice to slice your file object :

from itertools import islice
With open('file_nam') as f:
     my_my_patterns=list(islice(f,1,None,2))

And note that if you just want to loop over your patterns you don't need to convert the result of islice to list you can simply iterate over your iterator.

edited Sep 20 '15 at 19:37

answered Sep 20 '15 at 19:19

Mazdak

105,000
18
159
188

2

You'll need to add `rstrip()` after `line`, because now the sequences contain the newline at the end. – BioGeek Sep 20 '15 at 19:31
`islice` includes the `\n` as well – Pynchia Sep 20 '15 at 20:02

score 0 · Answer 3 · answered Sep 20 '15 at 19:24

assume your file is seq.in, then you can do this to get your list:

In [17]: with open ('seq.in','r') as f:
          extracted_list=[line[:-1] for line in f if line[0]!='>']

In [18]: extracted_list
Out[18]: ['CAACCTTGGCCAT', 'AATTGGCCTTGGA', 'AAGGTTCCA']

score 0 · Answer 4 · answered Sep 20 '15 at 20:02

0

line = ""

import re

with open('test') as f:
  lines = [line.rstrip()  for line in f if not re.search('sequence_name', line)]

print(lines)

['CAACCTTGGCCAT', 'AATTGGCCTTGGA', 'AAGGTTCCA']

answered Sep 20 '15 at 20:02

LetzerWille

5,355
4
23
26

Extracting sequences in Python

4 Answers4