I am writing a function that is supposed to go through a .fasta file of DNA sequences and create a dictionary of nucleotide (nt) and dinucleotide (dnt) frequencies for each sequence in the file. I am then storing each dictionary in a list called "frequency". This is the piece of code that is acting strange:
for fasta in seq_file:
freq = {}
dna = str(fasta.seq)
for base1 in ['A', 'T', 'G', 'C']:
onefreq = float(dna.count(base1)) / len(dna)
freq[base1] = onefreq
for base2 in ['A', 'T', 'G', 'C']:
dinucleotide = base1 + base2
twofreq = float(dna.count(dinucleotide)) / (len(dna) - 1)
freq[dinucleotide] = twofreq
frequency.append(freq)
(I am using biopython, by the way, so that I do not have to commit the entire fasta file to memory. Also this is for ssDNA, so I don't have to account for anti-sense dnt)
The frequencies that are recorded for the single nt add to 1.0, but the frequencies for the dnt do not add to 1.0. Which is od since the method of calculating the two types of frequencies are identical in my eyes.
I left the diagnostic print-statements and the "check" variables in:
for fasta in seq_file:
freq = {}
dna = str(fasta.seq)
check = 0.0
check2 = 0.0
for base1 in ['A', 'T', 'G', 'C']:
onefreq = float(dna.count(base1)) / len(dna)
freq[base1] = onefreq
check2 += onefreq
for base2 in ['A', 'T', 'G', 'C']:
dinucleotide = base1 + base2
twofreq = float(dna.count(dinucleotide)) / (len(dna) - 1)
check += twofreq
print(twofreq)
freq[dinucleotide] = twofreq
print("\n")
print(check, check2)
print(len(dna))
print("\n")
frequency.append(freq)
to get this output: (for only one sequence)
0.0894168466523
0.0760259179266
0.0946004319654
0.0561555075594
0.0431965442765
0.0423326133909
0.0747300215983
0.0488120950324
0.0976241900648
0.0483801295896
0.0539956803456
0.0423326133909
0.0863930885529
0.0419006479482
0.0190064794816
0.031101511879
(0.9460043196544274, 1.0)
2316
Here we can see the frequency of each of the 16 different dnt possible, the sum of all dnt frequencies (0.946) and the sum of all nt frequencies (1.0) and the length of the sequence.
Why does the dnt frequency not add up to 1.0?
Thanks for your help. I'm very new to python, and this is my first question, so I hope that this submissions is acceptable.