-1

Human genome is made of 24 different chromosomes (actually 23 pairs= 46 chromosomes). this chromosomes are called 1, 2, 3, ..., 22, X and Y. Each chromosome is a very long string of 'G', 'C', 'A' and 'T' characters (for example chromosome 1 is made of almost 24 milion characters). I have each chromosome in a file (strand of chromosome 1 is in 1.fa file).

*.fa file is called fasta file which is an standard file for DNA strands information. this file has a structure like this:

>gi|568815591|ref|NC_000007.14| Homo sapiens chromosome 7, GRCh38 Primary Assembly
CATTGCACTCCAGCCTGGGCAAAAACAGCGAAACTCCGTCTCAAAAAAAAAAAAAAGAAAAAAT
TAGCCAGGCATGGTGAAGTTGCAGTGAGCTGAGACTGCACCATTGCACTCCAGCCTGGGTAGCA

As you see this kind of files contain a first line that give us some information about the source of GCAT string.

I wrote this code to count the GC content (ratio of number of G+C characters to all characters):

homo_sapiens_chromosomes_List=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, "x", "y"]
for i in homo_sapiens_chromosomes_List:
    i=str (i)
    file_open= open (i+".fa", "r") #Opening each file

    file_read= file_open.read() #Reading file

    file_read=file_read.upper() #Uppercase characters

    G=float(file_read.count("G"))   #Count G in file

    C=float(file_read.count("C"))   #Count C in file

    A=float(file_read.count("A"))   #Count A in file

    T=float(file_read.count("T"))   #Count A in file

    print "There are %d Gs, %d Cs, %d As and %d Ts, in the DNA strand, of chromosome number %s." % (G, C, A, T, i)

    print "GC content of this chromosome is:", (G+C)*100/(A+T+G+C), "percent"       #Prints GC Content

Now I have some questions:

  1. How can I make this code more efficient (faster, shorter or...)

  2. When I try to count GC content, the first line of fasta file which is not a part of DNA strand, is also counted. I wrote this function to delete this line before counting the GC content (this code comes after this line: file_read=file_read.upper()):

Code:

def Fasta_Clean(): #a function to delete the first line of fasta file
    global file_read
    if file_read.isalnum()==False:
        file_read=file_read[1:]
        Fasta_Clean()
    Fasta_Clean()

But this code returned: "RuntimeError: maximum recursion depth exceeded in cmp", so I wrote this one:

def Fasta_Clean(): #a function to delete the first line of fasta file
    global file_read
    fas=file_read[0:number]
    if fas.isalnum()==False:
        file_read=file_read[1:]
        Fasta_Clean()
    Fasta_Clean()

Now, when variable fas is more than fas=file_read[0:90], I see "RuntimeError: maximum recursion depth exceeded in cmp" again. How can I solve this problem?

  1. If the fasta file is made of more than one strand and has an structure like this (in this example file is made of three different strands):

Example:

>gi|568815591|ref|NC_000007.14| Homo sapiens chromosome 7, GRCh38 Primary Assembly
GCGAAACTCCGTCTCAAAAAAAAAAAAAAGAAAAAATCATTGCACTCCAGCCTGGGCAAAAACA
CACCATTGCACTCCAGCCTGGGTAGCATAGCCAGGCATGGTGAAGTTGCAGTGAGCTGAGACTG

>gi|568815864|ref|NC_000009.14| Homo sapiens chromosome 8, GRCh38 Primary Assembly
CATTGCACTCCAGCCTGGGCAAAAACAGCGAAACTCCGTCTCAAAAAAAAAAAAAAGAAAAAAT
TAGCCAGGCATGGTGAAGTTGCAGTGAGCTGAGACTGCACCATTGCACTCCAGCCTGGGTAGCA

>gi|568815325|ref|NC_000009.14| Homo sapiens chromosome 9, GRCh38 Primary Assembly
CTGGGCAAAAACAGCGAAACTCCGTCTCAAAAAAAAAAAAAAGAAACATTGCACTCCAGCAAAT
GTGAGCTGAGACTGCACCATTGCTAGCCAGGCATGGTGAAGTTGCAACTCCAGCCTGGGTAGCA

In this conditions how can I count GC content of each strand separately?

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
  • 1
    For your run time error you are calling the function within itself. While this is not always a problem in this case you are unconditionally looping into itself thus it will fall down a recursive rabbit hole – user2097159 Dec 03 '14 at 12:35
  • but "file_read=file_read[1:]" changes the string and function have an end condition. – Stack Over Hello Dec 03 '14 at 12:37
  • *"function have an end condition"* - no it doesn't, you call recursively *whatever happens*. Note also that `if fas.isalnum()==False:` should just be `if not fas.isalnum():`, and that `global` is a sign you're doing something wrong. – jonrsharpe Dec 03 '14 at 12:38
  • You would find this task easier if you started with a more logical structure. As I see it you need to loop over each line of the file, and either: 1. Start a new count (header line); 2. Add to current count (data line); or 3. Skip line (blank line). You will probably find `collections.Counter` useful. – jonrsharpe Dec 03 '14 at 12:41

2 Answers2

3

Here this should be just about all the code you need.

from collections import Counter
chrome_list=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
             12, 13, 14, 15, 16, 17, 18, 19, 20,
             21, 22, "x", "y"]
for i in chrome_list:
    file_ = open('{}.fa'.format(i), 'r')
    broken_file = file_.read().split('\n\n')
    for line in broken_file:
        print Counter(line.split('\n')[1])
    file_.close()

If you are using windows or mac you may need to change the \n\n

user2097159
  • 862
  • 5
  • 13
1

For point 1. you can count multiple occurrences at once using:

from collections import Counter
z = ['G', 'C', 'C', 'A', 'T']
Counter(z)
>>>Counter({'G': 1, 'C': 2, 'A': 1, 'T':1})

Could you do point 1 and 2 with a loop?:

d = {'G':0, 'C':0, 'A':0, 'T':0}
count = 0
for line in inputfile:
    if count == 0:
        count += 1
        continue
    d[line] += 1
kezzos
  • 3,023
  • 3
  • 20
  • 37
  • Also due to this being DNA that means that they are pairs. You would only have to do the count on one side not both. – user2097159 Dec 03 '14 at 12:43