Finding how many times a nucleotide appear in the same position

Question

I'm new to python and im trying to solve a question which I am given a few dna sequences, for example: sequences = ["GAGGTAAACTCTG", "TCCGTAAGTTTTC", "CAGGTTGGAACTC", "ACAGTCAGTTCAC", "TAGGTCATTACAG", "TAGGTACTGATGC"]

I want to know how many times the nucleotide "A" is in each position of all of those sequences (the answer should be 'A': [1, 4, 1, 0, 0, 3, 4, 1, 1, 3, 0, 2, 0] in that case). what I tried to do is:

'A_pos = {"A":[sum(int(i[0]=="A") for i in sequences), sum(int(i[1]=="A") for i in sequences), sum(int(i[2]=="A") for i in sequences),'

and so on to each position in the index. Im trying to make it check all the positions at once instead of doing each position manually.

So what is your question? Is this method not working? If so, what result(s) are you getting instead? — MattDMo, Apr 12 '21 at 14:25
This method is working, Im trying to make the code shorted by checking all the index with a simpler code line — ran bar, Apr 12 '21 at 14:28

Pietro · Answer 1 · 2021-04-12T14:33:25.400

The code you posted is only partial, but you are iterating over sequences once per index. You can count them in a single pass using zip (even if in the end you have to read each char once, so my solution only changes the reading order):

A = []
for s in zip(*sequences):
    print(s)
    num_a = 0
    for nuc in s:
        if nuc == "A":
            num_a += 1
    A.append(num_a)
print(A)

The content of s are:

('G', 'T', 'C', 'A', 'T', 'T')
('A', 'C', 'A', 'C', 'A', 'A')
('G', 'C', 'G', 'A', 'G', 'G')

And so on, so you see that all the sequences are read one character at a time, and the result is:

[1, 4, 1, 0, 0, 3, 4, 1, 1, 3, 0, 2, 0]

If the sequences are not all of the same length, you can use itertools.zip_longest to pad the shorter sequences with another character.

Cheers!

score 0 · Accepted Answer · answered Apr 12 '21 at 14:29

0

You're close, but you need to keep track of the index rather than the individual lookups

[sum(x[i] == "A" for x in sequences) for i in range(len(sequences[0]))]

answered Apr 12 '21 at 14:29

Sayse

42,633
14
77
146

nbrix · Answer 3 · 2021-04-12T15:15:34.000

This will iterate through each index simultaneously and add one for each nucleotide occurrence.

result = {'A': 13*[0], 'G': 13*[0], 'T': 13*[0], 'C': 13*[0]}
for index, sequence in enumerate(zip(*sequences)):
    for nucleotide in sequence:
        result[nucleotide][index] += 1

Output:

{'A': [1, 4, 1, 0, 0, 3, 4, 1, 1, 3, 0, 2, 0], 'G': [1, 0, 4, 6, 0, 0, 1, 3, 1, 0, 0, 1, 2], 'T': [3, 0, 0, 0, 6, 1, 0, 2, 3, 3, 2, 3, 0], 'C': [1, 2, 1, 0, 0, 2, 1, 0, 1, 0, 4, 0, 4]}

Finding how many times a nucleotide appear in the same position

3 Answers3