I'm just making my first steps to try to learn a bit of Python. Currently working my way through the Rosalind online course which aims to teach bioinformatics python skills. (very good by the way, see: rosalind.info)
I am struggling with one particular problem. I have a file in FASTA format which has the form thus:
>Sequence_Header_1
ACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGT
>Sequence_Header_2
ACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGT
I need to calculate the percentage of G and C in each entry of the file (excluding the headers) and return this number, example:
>Sequence_Header_1
48.75%
>Sequence_header_2
52.43%
My code so far is:
file = open("input.txt" , "r")
for line in file:
if line.startswith(">"):
print(line.rstrip())
else:
print ('%3.2f' % (line.count('G')+line.count('C')/len(line)*100))
file.close()
Which is doing almost what I need it to do. I am just having trouble where the sequence data crosses multiple lines. At the moment I get the % GC content for every line in the file rather than returning a single figure for each entry, example:
>Sequence_Header_1
48.75%
52.65%
>Sequence_header_2
52.43%
50.25%
How can I apply my formula to the data which crosses multiple lines?
Thanks in advance,