0

given a FASTA text file (Rosalind_gc.txt), I am supposed to go through each DNA record and identify the percentage (%) of Guanine-Cytosine (GC) content.

Example of this is :

Sample Dataset:

>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG    
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT

Sample output:

Rosalind_0808 60.919540

So basically go through each string, count the amt of times G/C show up and then divide that total by the length of each string. My issue is learning how to identify the breaks in code (i.e. >Rosalind_6404 ). I would like an example of this code without using Biopython and also with the biopython approach.

Alain T.
  • 40,517
  • 4
  • 31
  • 51
  • 1
    I think there are already some tools developed to read fasta files, is there a particular case you want to write it on your own? WGS data can be _large_. They were typically implemented in C. – knh190 May 30 '19 at 21:21

3 Answers3

2

You could read the file line by line and accumulate sequence data up to the next line that starts with ">" (plus one more time for the end of the file)

def getCount(seq):
    return seq.count("G")+seq.count("C") 

with open("input.txt","r") as file:
    sequence = ""
    name     = ""
    for line in file:
        line = line.strip()
        if not line.startswith(">"):
            sequence += line
            continue
        if name != "":
            print(name, 100*getCount(sequence)/len(sequence))
        name     = line[1:]
        sequence = ""
    print(name, 100*getCount(sequence)/len(sequence))

# Rosalind_6404 53.75
# Rosalind_5959 53.57142857142857
# Rosalind_0808 60.91954022988506
Alain T.
  • 40,517
  • 4
  • 31
  • 51
2

Since you're looking for a Biopython solution, here is a very simple one:

from Bio import SeqIO
from Bio.SeqUtils import GC

for r in SeqIO.parse('Rosalind_gc.fa', 'fasta'):
    print(r.id, GC(r.seq))

Outputs:

Rosalind_6404 53.75
Rosalind_5959 53.57142857142857
Rosalind_0808 60.91954022988506
Chris_Rands
  • 38,994
  • 14
  • 83
  • 119
1

You may want to make use of precompiled C modules as much as possible for performance issue. There's one solution using regex:

seq = 'CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG'

import re
perc = re.subn(r'[GC]', '', seq) / len(seq)

And also handle the ">" lines:

seq = []
name = ''

for line in open('Rosalind_gc.txt'):
    if not line.startswith('>'):
        seq.append(line.strip())
    else:
        if seq:
            seq = ''.join(seq)
            perc = re.subn(r'[GC]', '', seq) / len(seq)
            print('{} has GC percent: {}'.format(name, perc * 100))
            seq = []
        name = line.strip()
knh190
  • 2,744
  • 1
  • 16
  • 30
  • Python's `str.count()` is implemented in C (for CPython), so why use a regex? – Chris_Rands May 31 '19 at 12:57
  • @Chris_Rands In this case `str.count()` is fast and concise but regex is more flexible (can match multiple patterns in one run). Yet there are also Biopython implementations fast enough so speaking of this question, it's just providing another option. – knh190 May 31 '19 at 14:25
  • @Chris_Rands Have you found any post answered your question? It has been already pretty much answered in my view. Yet I don't see any feedback. – knh190 May 31 '19 at 14:28
  • 1
    Regex doesn't work; matches "GC" exactly instead of matching characters "G" or "C" (patterns "[GC]" or "G|C" would work). Also the re.subn() returns a tuple that needs to be indexed. Looks good otherwise. – Ghoti Jun 03 '19 at 21:50
  • @Ghoti I misread the question. I thought was finding consequent GC. I'd fix the answer. – knh190 Jun 03 '19 at 23:08