So, I'm simply trying to calculate single nucleotide frequencies(A, T, C, G) in a HUGE file that contains pattern similar to this: TTTGTATAAGAAAAAATAGG.
That would give me one line of output of the entire file such as:
The single nucleotide frequency matrix of T.volcanium Genome is: {'A': [234235], 'C': [234290], 'G': [32456], 'T': [346875]}
here is my code (without file path, open, close and main)
def freq_dict_of_lists_v1(dna_list):
n = max([len(dna) for dna in dna_list])
frequency_matrix = {
'A': [0] * n,
'C': [0] * n,
'G': [0] * n,
'T': [0] * n,
}
for dna in dna_list:
for index, base in enumerate(dna):
frequency_matrix[base][index] += 1
return frequency_matrix
for line in file:
dna_list = file.readline().rstrip("\n")
frequency_matrix = freq_dict_of_lists_v1(dna_list)
print("The single nucleotide frequency matrix of T.volcanium Genome is: ")
pprint.pprint(frequency_matrix)
And this is my output.
The single nucleotide frequency matrix of T.volcanium Genome is:
{'A': [21], 'C': [10], 'G': [11], 'T': [18]}
The single nucleotide frequency matrix of T.volcanium Genome is:
{'A': [31], 'C': [6], 'G': [4], 'T': [19]}
The single nucleotide frequency matrix of T.volcanium Genome is:
{'A': [23], 'C': [9], 'G': [10], 'T': [18]}
The single nucleotide frequency matrix of T.volcanium Genome is:
{'A': [17], 'C': [8], 'G': [9], 'T': [26]}
The single nucleotide frequency matrix of T.volcanium Genome is:
{'A': [15], 'C': [13], 'G': [9], 'T': [23]}
The single nucleotide frequency matrix of T.volcanium Genome is:
{'A': [21], 'C': [12], 'G': [10], 'T': [17]}
The single nucleotide frequency matrix of T.volcanium Genome is:
{'A': [20], 'C': [9], 'G': [12], 'T': [19]}
The single nucleotide frequency matrix of T.volcanium Genome is:
{'A': [15], 'C': [15], 'G': [10], 'T': [20]}
The single nucleotide frequency matrix of T.volcanium Genome is:
{'A': [20], 'C': [11], 'G': [10], 'T': [19]}
The single nucleotide frequency matrix of T.volcanium Genome is:
{'A': [26], 'C': [13], 'G': [7], 'T': [14]}
The single nucleotide frequency matrix of T.volcanium Genome is:
{'A': [12], 'C': [13], 'G': [13], 'T': [22]}
The single nucleotide frequency matrix of T.volcanium Genome is:
{'A': [20], 'C': [16], 'G': [9], 'T': [15]}
The single nucleotide frequency matrix of T.volcanium Genome is:
{'A': [22], 'C': [12], 'G': [6], 'T': [20]}
So it is computing it line per line. I've tried taking out the for loop, or taking off the readlines, but then it will only give me one line of output for only one line in the file. not the entire file.
I feel like I'm overthinking this. I'm sure there's a simple way to read the entire file and print one single line of output with the total frequencies... Any insight is appreciated.