Python GC Counter - Rosalind

Question

I'm attempting to write a programme that will calculate the GC content in each of a series of sequences (input in fasta format) and then return the name of the sequence with the highest percentage and its GC percentage. As per this Rosalind problem.

I've finally stopped getting error messages but my code doesn't appear to do anything. Does anyone have any idea why that might be?

#Define functions
#Calculate GC percentage 
def Percent(sequence):
G_count = sequence.count ('G')
C_count = sequence.count ('C')
Total_count = len(sequence)
GC_Sum = int(G_count) + int(C_count)
Percent_GC = GC_Sum / Total_count
Per_GC = (Percent_GC)*100
return Per_GC

Input = input ("Input Sequence")

#Fasta file into dictionary
fasta_dictionary = {}
sequence_name = ""
for line in Input:
    line = line.strip()
    if not line:
        continue
    if line.startswith(">"):
        sequence_name = line[1:]
        if sequence_name not in fasta_dictionary:
            fasta_dictionary[sequence_name] = []
        continue
    sequence = line
    fasta_dictionary[sequence_name].append(sequence)

#Put GC values for each sequence into dictionary
dictionary = dict()
for sequence_name in fasta_dictionary:
dictionary[sequence_name] = float(Percent(sequence))

#Find highest
for sequence_name, sequence in fasta_dictionary.items():
    inverse = [(sequence, sequence_name) for sequence_name, sequence in dictionary.items()]
    highest_GC = max(inverse)[1]  

#Find sequence name
for sequence_name, sequence in fasta_dictionary.items():
        if sequence == highest_GC:
            print ((sequence_name) + ' ' + (highest_GC))

You don't seem to open your data file. From what I understand, `Input` is the _name_ of the data file, but not the data file itself. Try writing `with open(Input, 'r') as :` and then: `for line in :`. See what happens. (Also, if you're using Python 2.x, be careful of `GC_Sum / Total_count` because it involves _integer_ division.) — Pier Paolo, Jan 31 '16 at 19:17
Also, please properly indent you python code. Indentation is syntactically important. — pppery, Jan 31 '16 at 19:52
I lost some of the indents when I copied it onto here. They're right on the real file. — Carey, Feb 02 '16 at 09:06

score 1 · Answer 1 · answered Jan 31 '16 at 22:17

So, Pier Paolo is correct change the first line to with open() and indent the rest of your code under that as shown below.

with open('/path/to/your/fasta.fasta', 'r') as Input:
   fasta_dictionary = {}

He is also correct on the division -- this should help inside your Percent function. Percent_GC = float(GC_Sum) / Total_count

Instead of appending, just assign sequence as a string.

sequence = line
fasta_dictionary[sequence_name] = sequence

Next, when you are calling your Percent function, you are passing sequence, after you've exited the for loop where you define each sequence iteratively. You have them stored in a dictionary called fasta_dictionary, so change this piece of code.

for sequence_name in fasta_dictionary:
        dictionary[sequence_name] = float(Percent(fasta_dictionary[sequence_name]))

Finally, at the end, you are checking if sequence == highest_GC:. This is what you're currently checking:

for sequence_name, sequence in fasta_dictionary.items():
            print sequence

prints a str of the actual sequence data.

'ATTGCGCTANANAGCTANANCGATAGANCACGATNGAGATAGACTATAGC'

and highest_GC is the "name" of the sequence

>sequence1

change that to read if sequence_name == highest_GC

Running your code with the changes above always printed the name of the sequence with the highest GC content %. There are a lot of other unnecessary steps and repetitive code, but hopefully this gets you started. Good Luck!

score 0 · Answer 2 · answered Jun 10 '16 at 11:59

Another solution for your GC problem would be to use the Counter higher-order data structure in python. It can automatically set and count your nucleotides for you such that you can just directly ask for the numbers to calculate as follows:

from collections import Counter

#set a var to hold your dna
myDna = ''
#open your Dna fasta
with open('myFasta', 'r') as data:
     for line in data:
          if '>' in line:
               continue
          myDna += line.strip()

#Now count your dna
myNucleotideCounts = Counter(myDna)

#calculate GC content
myGC = (myNucleotideCounts['G'] + myNucleotideCounts['C']) / float(len(myDna))

print('Dna GC Content = {0}'.format(myGC))

Python GC Counter - Rosalind

2 Answers2