Adjusting the fucntion to find location for more than one base

Question

I created this function and it finds the location of the base in a dna sequences. Like dna = ['A', 'G', 'C', 'G', 'T', 'A', 'G', 'T', 'C', 'G', 'A', 'T', 'C', 'A', 'A', 'T', 'T', 'A', 'T', 'A', 'C', 'G', 'A', 'T', 'C', 'G', 'G', 'G', 'T', 'A', 'T']. I need it to find more than one base at a time like 'A''T'. Can anyone help?

def position(list, value):
     pos = []
     for n in range(len(list)):
             if list[n] == value:
                     pos.append(n)
     return pos

score 1 · Answer 1 · answered Jul 08 '22 at 17:25

1

You can work with the dna sequence as a string, and then use regex:

import re

dna_str = ''.join(dna)

pattern = r'AT'

pos = [(i.start(0), i.end(0)) for i in re.finditer(pattern, dna_str)]
print(pos)

[(10, 12), (14, 16), (17, 19), (22, 24), (29, 31)]

answered Jul 08 '22 at 17:25

Ignatius Reilly

1,594
2
6
15

score 0 · Answer 2 · answered Jul 08 '22 at 17:21

0

side note, good not to use keywords for variable names. list is a python keyword

def position(l: list, values: list): -> list
     pos = []
     for i, val in enumerate(l):
             if val in values:
                     pos.append(i)
     return pos

answered Jul 08 '22 at 17:21

smcrowley

451
3
10

2

You *can't* use [keywords](https://docs.python.org/3/reference/lexical_analysis.html#keywords) as variable names, and `list` isn't one, and `l` is explicitly discouraged by PEP 8. – Kelly Bundy Jul 08 '22 at 17:40

score 0 · Answer 3 · answered Jul 08 '22 at 17:23

You should definitely use Python built-in functions. For instance, instead of position(list, value) you could use comprehension

[n for n,x in enumerate(dna) if x == 'A']

Finding a bigram could be reduced to the above if you consider pairs of letters:

[n for n,x in enumerate(zip(dna[:-1], dna[1:])) if x==('A','T')]

If instead you want to find the positions of either 'A' or 'T', you could just specify that as the condition

[n for n,x in enumerate(dna) if x in ('A', 'T')]

Joffan · Answer 4 · 2022-07-08T20:10:42.323

0

Python will efficiently find a substring of a string starting from any point.

def positions(dnalist, substr):
    dna = "".join(dnalist) # make single string
    st = 0
    pos = []
    while True: 
        a_pos = dna.find(substr, st)
        if a_pos < 0:
            return pos
        pos.append(a_pos)
        st = a_pos + 1

Test usage:

>>> testdna = ['A', 'G', 'C', 'G', 'T', 'A', 'G', 'T', 'C', 'G', 'A', 'T', 'C', 'A', 'A', 'T', 'T', 'A', 'T', 'A', 'C', 'G', 'A', 'T', 'C', 'G', 'G', 'G', 'T', 'A', 'T']
>>> positions(testdna, "AT")
[10, 14, 17, 22, 29]

edited Jul 08 '22 at 20:10

answered Jul 08 '22 at 17:29

Joffan

1,485
1
13
18

this will return only the first location of each base. not sure if all positions are wanted but that's what their method does for a single base so it should do the same for many bases – smcrowley Jul 08 '22 at 17:39
@smcrowley you are mistaken - this code returns all positions, including overlapping positions. I added a sample run of the code in interactive mode – Joffan Jul 08 '22 at 20:09
that's my bad, you're correct about it not stopping after the first find. but you're finding the position of "A" only when it's followed by "T", not getting the indices of all "A"s and "T"s – smcrowley Jul 08 '22 at 20:25
Absolutely true; that is my reading of the requirement. – Joffan Jul 08 '22 at 20:25

Adjusting the fucntion to find location for more than one base

4 Answers4