How to calculate one frequency matrix from an entire genome file?

Question

So, I'm simply trying to calculate single nucleotide frequencies(A, T, C, G) in a HUGE file that contains pattern similar to this: TTTGTATAAGAAAAAATAGG.

That would give me one line of output of the entire file such as:

The single nucleotide frequency matrix of T.volcanium Genome is: {'A': [234235], 'C': [234290], 'G': [32456], 'T': [346875]}

here is my code (without file path, open, close and main)

 def freq_dict_of_lists_v1(dna_list):
    n = max([len(dna) for dna in dna_list])
    frequency_matrix = {
        'A': [0] * n,
        'C': [0] * n,
        'G': [0] * n,
        'T': [0] * n,
    }
    for dna in dna_list:
        for index, base in enumerate(dna):
            frequency_matrix[base][index] += 1

    return frequency_matrix

for line in file:
    dna_list = file.readline().rstrip("\n")
    frequency_matrix = freq_dict_of_lists_v1(dna_list)
    print("The single nucleotide frequency matrix of T.volcanium Genome is: ")
    pprint.pprint(frequency_matrix)

And this is my output.

The single nucleotide frequency matrix of T.volcanium Genome is: 
{'A': [21], 'C': [10], 'G': [11], 'T': [18]}
The single nucleotide frequency matrix of T.volcanium Genome is: 
{'A': [31], 'C': [6], 'G': [4], 'T': [19]}
The single nucleotide frequency matrix of T.volcanium Genome is: 
{'A': [23], 'C': [9], 'G': [10], 'T': [18]}
The single nucleotide frequency matrix of T.volcanium Genome is: 
{'A': [17], 'C': [8], 'G': [9], 'T': [26]}
The single nucleotide frequency matrix of T.volcanium Genome is: 
{'A': [15], 'C': [13], 'G': [9], 'T': [23]}
The single nucleotide frequency matrix of T.volcanium Genome is: 
{'A': [21], 'C': [12], 'G': [10], 'T': [17]}
The single nucleotide frequency matrix of T.volcanium Genome is: 
{'A': [20], 'C': [9], 'G': [12], 'T': [19]}
The single nucleotide frequency matrix of T.volcanium Genome is: 
{'A': [15], 'C': [15], 'G': [10], 'T': [20]}
The single nucleotide frequency matrix of T.volcanium Genome is: 
{'A': [20], 'C': [11], 'G': [10], 'T': [19]}
The single nucleotide frequency matrix of T.volcanium Genome is: 
{'A': [26], 'C': [13], 'G': [7], 'T': [14]}
The single nucleotide frequency matrix of T.volcanium Genome is: 
{'A': [12], 'C': [13], 'G': [13], 'T': [22]}
The single nucleotide frequency matrix of T.volcanium Genome is: 
{'A': [20], 'C': [16], 'G': [9], 'T': [15]}
The single nucleotide frequency matrix of T.volcanium Genome is: 
{'A': [22], 'C': [12], 'G': [6], 'T': [20]}

So it is computing it line per line. I've tried taking out the for loop, or taking off the readlines, but then it will only give me one line of output for only one line in the file. not the entire file.

I feel like I'm overthinking this. I'm sure there's a simple way to read the entire file and print one single line of output with the total frequencies... Any insight is appreciated.

Make sure you're actually iterating over each line. Print out `line` each time. — TankorSmash, Feb 02 '17 at 23:11

score 0 · Answer 1 · edited May 23 '17 at 11:53

I see two problems with your solution.

You are keeping track of bases per position, when in your question it says you want to keep track of counts across ALL lines
You are calling the function once per line.

My edits below should address. See the comments for explanation

def freq_dict_of_lists_v1(dna_list):
    frequency_matrix = {    # We are only keeping one variable per base
        'A': [0],           # so that we calculate counts across all lines
        'C': [0],
        'G': [0],
        'T': [0],
    }
    for dna in dna_list:
        for base in dna:   # No longer need index, so I removed enumerate
            frequency_matrix[base] += 1   # Change here since dict structure changed

    return frequency_matrix

# Unlike before, we are now appending all the lines into dna_list
for line in file:
    dna_list.append(file.readline().rstrip("\n"))

# Calling freq_dict_of_lists_v1 on ALL the lines at once (it is now out of loop)
frequency_matrix = freq_dict_of_lists_v1(dna_list)
print("The single nucleotide frequency matrix of T.volcanium Genome is: ")
pprint.pprint(frequency_matrix)

One caveat for this solution is to make sure that all bases in the file are Upper case. Also, make sure there are no non-ACGT characters (some sequences have special gap characters, etc.). If it is the case there there are other characters, you can refer to this thread where your default entry could be something like Gap.

parasu · Answer 2 · 2017-02-02T23:20:02.530

0

Not sure what HUGE means MB? GB?, but this is the simplest solution. However note that it loads the entire file into memory.

# open file with sequence
with open(path_to_file) as f:
    seq = f.read()

# count element A in sequence
seq.count('A')

edited Feb 02 '17 at 23:20

answered Feb 02 '17 at 23:07

parasu

933
7
13

How to calculate one frequency matrix from an entire genome file?

2 Answers2