2

I'm new to python and im trying to solve a question which I am given a few dna sequences, for example: sequences = ["GAGGTAAACTCTG", "TCCGTAAGTTTTC", "CAGGTTGGAACTC", "ACAGTCAGTTCAC", "TAGGTCATTACAG", "TAGGTACTGATGC"]

I want to know how many times the nucleotide "A" is in each position of all of those sequences (the answer should be 'A': [1, 4, 1, 0, 0, 3, 4, 1, 1, 3, 0, 2, 0] in that case). what I tried to do is:

'A_pos = {"A":[sum(int(i[0]=="A") for i in sequences), sum(int(i[1]=="A") for i in sequences), sum(int(i[2]=="A") for i in sequences),'

and so on to each position in the index. Im trying to make it check all the positions at once instead of doing each position manually.

ran bar
  • 75
  • 5

3 Answers3

0

The code you posted is only partial, but you are iterating over sequences once per index. You can count them in a single pass using zip (even if in the end you have to read each char once, so my solution only changes the reading order):

A = []
for s in zip(*sequences):
    print(s)
    num_a = 0
    for nuc in s:
        if nuc == "A":
            num_a += 1
    A.append(num_a)
print(A)

The content of s are:

('G', 'T', 'C', 'A', 'T', 'T')
('A', 'C', 'A', 'C', 'A', 'A')
('G', 'C', 'G', 'A', 'G', 'G')

And so on, so you see that all the sequences are read one character at a time, and the result is:

[1, 4, 1, 0, 0, 3, 4, 1, 1, 3, 0, 2, 0]

If the sequences are not all of the same length, you can use itertools.zip_longest to pad the shorter sequences with another character.

Cheers!

Pietro
  • 1,090
  • 2
  • 9
  • 15
0

You're close, but you need to keep track of the index rather than the individual lookups

[sum(x[i] == "A" for x in sequences) for i in range(len(sequences[0]))]
Sayse
  • 42,633
  • 14
  • 77
  • 146
0

This will iterate through each index simultaneously and add one for each nucleotide occurrence.

result = {'A': 13*[0], 'G': 13*[0], 'T': 13*[0], 'C': 13*[0]}
for index, sequence in enumerate(zip(*sequences)):
    for nucleotide in sequence:
        result[nucleotide][index] += 1

Output:

{'A': [1, 4, 1, 0, 0, 3, 4, 1, 1, 3, 0, 2, 0], 'G': [1, 0, 4, 6, 0, 0, 1, 3, 1, 0, 0, 1, 2], 'T': [3, 0, 0, 0, 6, 1, 0, 2, 3, 3, 2, 3, 0], 'C': [1, 2, 1, 0, 0, 2, 1, 0, 1, 0, 4, 0, 4]}
nbrix
  • 296
  • 2
  • 6