Count GC content of fasta using python without error

Question

Human genome is made of 24 different chromosomes (actually 23 pairs= 46 chromosomes). this chromosomes are called 1, 2, 3, ..., 22, X and Y. Each chromosome is a very long string of 'G', 'C', 'A' and 'T' characters (for example chromosome 1 is made of almost 24 milion characters). I have each chromosome in a file (strand of chromosome 1 is in 1.fa file).

*.fa file is called fasta file which is an standard file for DNA strands information. this file has a structure like this:

>gi|568815591|ref|NC_000007.14| Homo sapiens chromosome 7, GRCh38 Primary Assembly
CATTGCACTCCAGCCTGGGCAAAAACAGCGAAACTCCGTCTCAAAAAAAAAAAAAAGAAAAAAT
TAGCCAGGCATGGTGAAGTTGCAGTGAGCTGAGACTGCACCATTGCACTCCAGCCTGGGTAGCA

As you see this kind of files contain a first line that give us some information about the source of GCAT string.

I wrote this code to count the GC content (ratio of number of G+C characters to all characters):

homo_sapiens_chromosomes_List=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, "x", "y"]
for i in homo_sapiens_chromosomes_List:
    i=str (i)
    file_open= open (i+".fa", "r") #Opening each file

    file_read= file_open.read() #Reading file

    file_read=file_read.upper() #Uppercase characters

    G=float(file_read.count("G"))   #Count G in file

    C=float(file_read.count("C"))   #Count C in file

    A=float(file_read.count("A"))   #Count A in file

    T=float(file_read.count("T"))   #Count A in file

    print "There are %d Gs, %d Cs, %d As and %d Ts, in the DNA strand, of chromosome number %s." % (G, C, A, T, i)

    print "GC content of this chromosome is:", (G+C)*100/(A+T+G+C), "percent"       #Prints GC Content

Now I have some questions:

How can I make this code more efficient (faster, shorter or...)
When I try to count GC content, the first line of fasta file which is not a part of DNA strand, is also counted. I wrote this function to delete this line before counting the GC content (this code comes after this line: file_read=file_read.upper()):

Code:

def Fasta_Clean(): #a function to delete the first line of fasta file
    global file_read
    if file_read.isalnum()==False:
        file_read=file_read[1:]
        Fasta_Clean()
    Fasta_Clean()

But this code returned: "RuntimeError: maximum recursion depth exceeded in cmp", so I wrote this one:

def Fasta_Clean(): #a function to delete the first line of fasta file
    global file_read
    fas=file_read[0:number]
    if fas.isalnum()==False:
        file_read=file_read[1:]
        Fasta_Clean()
    Fasta_Clean()

Now, when variable fas is more than fas=file_read[0:90], I see "RuntimeError: maximum recursion depth exceeded in cmp" again. How can I solve this problem?

If the fasta file is made of more than one strand and has an structure like this (in this example file is made of three different strands):

Example:

>gi|568815591|ref|NC_000007.14| Homo sapiens chromosome 7, GRCh38 Primary Assembly
GCGAAACTCCGTCTCAAAAAAAAAAAAAAGAAAAAATCATTGCACTCCAGCCTGGGCAAAAACA
CACCATTGCACTCCAGCCTGGGTAGCATAGCCAGGCATGGTGAAGTTGCAGTGAGCTGAGACTG

>gi|568815864|ref|NC_000009.14| Homo sapiens chromosome 8, GRCh38 Primary Assembly
CATTGCACTCCAGCCTGGGCAAAAACAGCGAAACTCCGTCTCAAAAAAAAAAAAAAGAAAAAAT
TAGCCAGGCATGGTGAAGTTGCAGTGAGCTGAGACTGCACCATTGCACTCCAGCCTGGGTAGCA

>gi|568815325|ref|NC_000009.14| Homo sapiens chromosome 9, GRCh38 Primary Assembly
CTGGGCAAAAACAGCGAAACTCCGTCTCAAAAAAAAAAAAAAGAAACATTGCACTCCAGCAAAT
GTGAGCTGAGACTGCACCATTGCTAGCCAGGCATGGTGAAGTTGCAACTCCAGCCTGGGTAGCA

In this conditions how can I count GC content of each strand separately?

For your run time error you are calling the function within itself. While this is not always a problem in this case you are unconditionally looping into itself thus it will fall down a recursive rabbit hole — user2097159, Dec 03 '14 at 12:35
but "file_read=file_read[1:]" changes the string and function have an end condition. — Stack Over Hello, Dec 03 '14 at 12:37
*"function have an end condition"* - no it doesn't, you call recursively *whatever happens*. Note also that `if fas.isalnum()==False:` should just be `if not fas.isalnum():`, and that `global` is a sign you're doing something wrong. — jonrsharpe, Dec 03 '14 at 12:38
You would find this task easier if you started with a more logical structure. As I see it you need to loop over each line of the file, and either: 1. Start a new count (header line); 2. Add to current count (data line); or 3. Skip line (blank line). You will probably find `collections.Counter` useful. — jonrsharpe, Dec 03 '14 at 12:41

user2097159 · Accepted Answer · 2014-12-03T13:13:38.680

Here this should be just about all the code you need.

from collections import Counter
chrome_list=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
             12, 13, 14, 15, 16, 17, 18, 19, 20,
             21, 22, "x", "y"]
for i in chrome_list:
    file_ = open('{}.fa'.format(i), 'r')
    broken_file = file_.read().split('\n\n')
    for line in broken_file:
        print Counter(line.split('\n')[1])
    file_.close()

If you are using windows or mac you may need to change the \n\n

kezzos · Answer 2 · 2014-12-03T12:59:53.553

1

For point 1. you can count multiple occurrences at once using:

from collections import Counter
z = ['G', 'C', 'C', 'A', 'T']
Counter(z)
>>>Counter({'G': 1, 'C': 2, 'A': 1, 'T':1})

Could you do point 1 and 2 with a loop?:

d = {'G':0, 'C':0, 'A':0, 'T':0}
count = 0
for line in inputfile:
    if count == 0:
        count += 1
        continue
    d[line] += 1

edited Dec 03 '14 at 12:59

answered Dec 03 '14 at 12:42

kezzos

3,023
3
20
37

Also due to this being DNA that means that they are pairs. You would only have to do the count on one side not both. – user2097159 Dec 03 '14 at 12:43

Count GC content of fasta using python without error

2 Answers2