Apply a formula to a line of data where it crosses multiple lines

Question

I'm just making my first steps to try to learn a bit of Python. Currently working my way through the Rosalind online course which aims to teach bioinformatics python skills. (very good by the way, see: rosalind.info)

I am struggling with one particular problem. I have a file in FASTA format which has the form thus:

>Sequence_Header_1
ACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGT
>Sequence_Header_2
ACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGT

I need to calculate the percentage of G and C in each entry of the file (excluding the headers) and return this number, example:

>Sequence_Header_1
48.75%
>Sequence_header_2
52.43%

My code so far is:

file = open("input.txt" , "r")
for line in file:
    if line.startswith(">"):
        print(line.rstrip())        
    else:
        print ('%3.2f' % (line.count('G')+line.count('C')/len(line)*100))
file.close()

Which is doing almost what I need it to do. I am just having trouble where the sequence data crosses multiple lines. At the moment I get the % GC content for every line in the file rather than returning a single figure for each entry, example:

>Sequence_Header_1
48.75%
52.65%
>Sequence_header_2
52.43%
50.25%

How can I apply my formula to the data which crosses multiple lines?

Thanks in advance,

score 1 · Answer 1 · answered Jul 11 '13 at 15:56

1

Not really a direct answer to your question, but I think it is a better way to go! If you are planning on doing more bioinformatics in python, have a look at biopython. It will handle fasta files and other common sequence manipulations (and a whole lot more!) for you.

so for example:

from Bio import SeqIO
from Bio.SeqUtils import GC

for rec in SeqIO.parse("input.txt", "fasta"):
    print rec.id,GC(rec.seq)

answered Jul 11 '13 at 15:56

Stylize

1,058
5
16
32

Thanks Stylize, I'll definitely give biopython a look when I'm doing this stuff "for real". – mu0u506a Jul 15 '13 at 09:37

joente · Answer 2 · 2013-07-10T14:58:54.473

I think this is what you are looking for:

# Read the input file
with open("input.txt", "r") as f:
    s = f.read()

# Loop through the Sequences
for i, b in enumerate(s.split("Sequence_Header_")):
    if not b: continue # At least the first one is empty 
                       # because there is no data before the first sequence
    # Join the lines with data
    data = "".join(b.split("\n")[1:])

    # Print the result
    print("Sequence_Header_{i}\n{p:.2f}%".format(
        i=i, p=(data.count('G')+data.count('C'))/len(data)*100))

Note: I can't find the '>' sign in your example. If your header is starting with > then you can re-write the code to s.split(">") and the code should still be fine.

Hi Joente, my mistake, the headers should indeed be starting with a ">". Thanks for your input. There's a lot of stuff in there that is completely new to me. Going to try to get my simple-noob version working first! — mu0u506a, Jul 11 '13 at 09:44

seth · Answer 3 · 2013-07-10T15:15:15.663

0

Try keeping a running tally, and then having this tally reset when a new header is found.

count = 0.0
line_length=0.0
seen_header = False
for line in open("input.txt" , "r"): #saves memory.
    if line.startswith('>'):
        if not seen_header:
            header = line.strip()
            seen_header=True
        if line_length > 0.0:
            print header,'\n',('%3.2f' % (100.0*count/line_length))
            count = 0.0
            line_length=0.0
            seen_header = False
    else:
        count += line.count('C') + line.count('C')
        line_length += len(line.strip())
print header,'\n',('%3.2f' % (100.0*count/line_length))

Also be careful about division in python, remember that the default is integer division. i.e. 5/2 = 2. You can avoid this by using decimals in your variables, or float().

edit: made nicer, also should have been line_length+=len(line.strip()), to avoid counting the newline "\n" as two characters.

edited Jul 10 '13 at 15:15

answered Jul 10 '13 at 14:48

seth

1,778
16
17

Hi seth, I took your way of working as inspiration and tried to make my own version of it. I was **very** confused about why the len() function appeared to be adding a couple of extra characters until I saw your comment regarding the newlines, so Thanks!
The calculations now seem to be working correctly but for some reason my print statement is producing a leading 'space' character before my header. Any thoughts?
`print ((gc_total/seq_length)*100, '\n', header)` – mu0u506a Jul 11 '13 at 09:50
python's print automatically puts spaces between its arguments, but that isn't what is happening here, '\n' should result in header printing without a leading space. I suspect your header is defined by line.rstrip(), i.e. right strip. Try using .strip() instead. – seth Jul 12 '13 at 01:10

score 0 · Answer 4 · answered Jul 10 '13 at 14:48

It might not be possible to hold the entire file in memory. Assuming you cannot do s = f.read() all at once, you need to keep a running count of the number of letters and total letters until a new sequence starts. Something like this:

file = open("input.txt" , "r")
# for keeping count:
char_count = 0
char_total = 0
for line in file:
    if line.startswith(">"):
        if char_total != 0:
            # starting a new sequence, so calculate pct for last sequence
            char_pct = (char_count / char_total) * 100
            print ('%3.2f' % char_pct)
            # reset the count for the next sequence
            char_total = 0
            char_count = 0
        print(line.rstrip())        
    else:
        # continuing a sequence, so add to running counts
        char_count += line.count('G') + line.count('C')
        char_total += len(line)
file.close()

Thanks a lot for your help ChrisP. I took yours and seth's suggestions and tried to code them up myself. Almost there now I think! Everyone's help and super quick responses are very much appreciated. — mu0u506a, Jul 11 '13 at 09:51

score 0 · Answer 5 · answered Feb 22 '14 at 01:40

You can parse the fasta format and create a dictionary with the >ID as keys and the sequences as values, like so:

    from collections import defaultdict

    def parse_fasta(dataset):
        "Parses data in FASTA format, returning a dictionary of ID's and values"
        records = defaultdict(str)
        record_id = None
        for line in [l.strip() for l in dataset.splitlines()]:
            if line.startswith('>'):
                record_id = line[1:]
            else:
                records[record_id] += line
        return records

or you can rewrite this code a little and create a tuple / list. I prefer the dictionary, since it is already indexed. You can find me on the Rosalind site, if you still need help.

Apply a formula to a line of data where it crosses multiple lines

5 Answers5