4

I have a fasta file like this: myfasta.fasta

>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAATTATTA
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
>5_rRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>6_tRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA

I have a code that I want to use to separate sequences based on their ids that have matching patterns like 'CDS', 'tRNA' etc. In the code below, I am trying to use startswith and also match pattern in line which doesn't seem to work. Can someone please help me how to look for two conditions in line in python.

code: python mycode.py myfasta.fasta

#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
fasta = open(myfasta)

for line in fasta:
    if line.startswith('>') and 'CDS' in line:
        print(line)
    else:
        print(line)

Expected output (if I use CDS):

>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAATTATTA
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
MAPK
  • 5,635
  • 4
  • 37
  • 88
  • What is your expected output? – Dalvenjia Mar 27 '19 at 21:20
  • 1
    Mmmhh, seems like you're printing `line` anyway... How do you differentiate between positives and negatives? – Jacques Gaudin Mar 27 '19 at 21:21
  • seems like it works correctly. change the second print to something meaningful that could differentiate between having CDS and not having that. – maanijou Mar 27 '19 at 21:23
  • 1
    yeah...code seems to do what you think, except that you're going to get blank lines because you have EOL at the end of each of your lines, and then print() is adding another one. – CryptoFool Mar 27 '19 at 21:25
  • 1
    You're still just printing every line unconditionally!!! You have to do something different depending on if your test succeeds or fails, or what's the point? – CryptoFool Mar 27 '19 at 21:26
  • I do see the logic problem you're going to have. You're going to need to keep track of if you've just matched a '>' line or not to decide if you should print a line without a '>'. That's going to take a little thought, and probably a state variable. – CryptoFool Mar 27 '19 at 21:28
  • 1
    You have to treat lines with '>' at the front differently than those without, regardless of if they match 'CDS' or not. When you see '>' at the front of the line, you need to check to see if 'CDS' is in it to know if you should print it. But you ALSO have to set a flag, so that the next time you see a line without a '>' at the front, you'll know if you should print that line or not. That's the secret hear. Create a boolean variable to keep track of if you printed the last '>' line you saw or not, and then use that variable to decide what to do with non-'>' lines – CryptoFool Mar 27 '19 at 21:30
  • You misunderstand me. I'm telling you what you have to do to get the output you want - you have to be able to print some of your lines even if they don't match your test, because you want to print or not print lines without '>' at the front, based on the prior line. – CryptoFool Mar 27 '19 at 21:31
  • do you see what I'm getting at? If so, how about trying to add some .more logic and show us what you've tried next. HINT: take ` and 'CDS' in line` off of your `if` line. – CryptoFool Mar 27 '19 at 21:37
  • I'm trying to tell you. You want to always process the '>' lines in one code block, even if you won't always print them. You then have to decide in that block if you should print the line or not, and here's the key...you need to REMEMBER if you printed the line or not. Then, you do want your 'else:' block, but that block should use the fact that you remembered what you did on the prior line to decide if that line (a line without a '>') should be printed. – CryptoFool Mar 27 '19 at 21:41
  • No problem. This is how you learn! – CryptoFool Mar 27 '19 at 21:46
  • 2
    easiest with Biopython: `from Bio import SeqIO; SeqIO.write((r for r in SeqIO.parse('in.fa', 'fasta') if 'CDS' in r.id), 'out.fa', 'fasta')` – Chris_Rands Mar 28 '19 at 14:00

3 Answers3

4

Here is a code that works for you. If a line has CDS, it prints the line and the next lines. strip() removes the endline character while printing the line.

#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]

flag = False
with open(myfasta) as fasta:
    for line in fasta:
        if line.startswith('>') and 'CDS' in line:
            flag = True
        elif line.startswith('>'):
            flag = False
        if flag:
            print(line.strip())

Edit: You can remove the elif part as the following code:

#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]

flag = False
with open(myfasta) as fasta:
    for line in fasta:
        if line.startswith('>'):
            flag = 'CDS' in line
        if flag:
            print(line.strip())
maanijou
  • 1,067
  • 1
  • 14
  • 23
1

Maanijou's answer is fine.

Also, consider an alternative with a iterator instead.

EDIT: Updated the code based on your comments

#!/usr/bin/env python
import sys
import os
myfasta = sys.argv[1]
fasta = open(myfasta, "r+")

file_contents = iter(fasta)

try:
    print_flag = True
    while True:
        line = file_contents.next()
        if line.startswith('>'):
            if "CDS" in line:
                print (line.strip())
                print_flag = True
            else:
                print_flag = False
        else:
            if print_flag:
                print (line.strip())

except StopIteration:
    print ("Done")
    fasta.close()

Explanation

file_contents = iter(fasta) converts the iterable file object into an iterator on which you can simply keep calling next() till you run out of things to read

Why I do not recommend calling readlines as some other answers have is that sometimes fasta files can be big and calling readlines consumes significant memory.

if a line satisfies your search req you simply print it and the next line, if not you simply read the next line and do nothing,

Explanation for Update

  1. You got the Attribute error because of file modes, I could not reproduce it locally but I think opening the file with the right mode should fix it
  2. You now said there could be more than 1 genome sequence for CDS updated the code to print all the genome sequences for 1 CDS header in the file

I tested it with a modified fasta file like so

>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGCG
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
>5_rRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>6_tRNA
TTAAAAATTTCTGGGCCCCGGGAAAAAA

And this output

python fasta.py fasta.fasta
>1_CDS
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGGG
AAAAATTTCTGGGCCCCGGGCG
>2_CDS
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>3_CDS
TTTGGGAATTAAACCCT
>4_CDS
TTTGGGAATTAAACCCT
Done
Srini
  • 1,619
  • 1
  • 19
  • 34
  • getting this error: `AttributeError: '_io.TextIOWrapper' object has no attribute 'next'` – MAPK Mar 27 '19 at 22:01
  • I think the error was because of file modes, I could not reproduce it locally, but I have added a fix which I think will fix it for you – Srini Mar 27 '19 at 22:08
0

Is this what you want?

#!/usr/bin/env python
import sys
import os
from collections import defaultdict

myfasta = sys.argv[1]
with open(myfasta) as fasta:
    data = fasta.read().splitlines()

pattern_data = defaultdict(list)
index = 0
while index < len(data):
    if data[index].startswith('>'):
        start = data[index].index('_') + 1
        key = data[index][start:]
        pattern_data[key].append(data[index + 1])
    index += 2

At this point you are free to do whatever you please with the sorted data.

The above assumes that the whole file you parse follows the exact format shown above: 1 line starting with a ">" that id's the single line that follows. If you have multiple lines that follow, the code needs minor modification.

EDIT: I just read up on fasta files. I now know that they actually may have sequences that are longer than one line after they are identified. So the above code does need to be modified to account for multiline sequences. A more generalized approach is as follows:

#!/usr/bin/env python
import sys
import os
from collections import defaultdict

myfasta = sys.argv[1]
with open(myfasta) as fasta:
    data = fasta.read().splitlines()

id_line_indices = [index for index, line in enumerate(data) if line.startswith('>')]
id_line_indices.append(len(data))
pattern_buckets = defaultdict(list)

i = 0
while i < len(id_line_indices) - 1:
    start = data[id_line_indices[i]].index('_') + 1
    key = data[id_line_indices[i]][start:]

    sequence = [data[index] for index in range(id_line_indices[i] + 1, id_line_indices[i + 1])]
    sequence = ''.join(sequence)

    pattern_buckets[key].append(sequence)
    i += 1

This still achieves the same results for the above data set. For example,

print(pattern_buckets['CDS'])
print(pattern_buckets['rRNA'])

Will get you:

['AAAAATTTCTGGGCCCCGGGGG', 'TTAAAAATTTCTGGGCCCCGGGAAAAAA', 'TTTGGGAATTAAACCCT', 'TTTGGGAATTAAACCCT']
['TTAAAAATTTCTGGGCCCCGGGAAAAAA']
Perplexabot
  • 1,852
  • 3
  • 19
  • 22