-1

I created this function and it finds the location of the base in a dna sequences. Like dna = ['A', 'G', 'C', 'G', 'T', 'A', 'G', 'T', 'C', 'G', 'A', 'T', 'C', 'A', 'A', 'T', 'T', 'A', 'T', 'A', 'C', 'G', 'A', 'T', 'C', 'G', 'G', 'G', 'T', 'A', 'T']. I need it to find more than one base at a time like 'A''T'. Can anyone help?

def position(list, value):
     pos = []
     for n in range(len(list)):
             if list[n] == value:
                     pos.append(n)
     return pos
Raymond Chen
  • 44,448
  • 11
  • 96
  • 135

4 Answers4

1

You can work with the dna sequence as a string, and then use regex:

import re

dna_str = ''.join(dna)

pattern = r'AT'

pos = [(i.start(0), i.end(0)) for i in re.finditer(pattern, dna_str)]
print(pos)

[(10, 12), (14, 16), (17, 19), (22, 24), (29, 31)]
Ignatius Reilly
  • 1,594
  • 2
  • 6
  • 15
0

side note, good not to use keywords for variable names. list is a python keyword

def position(l: list, values: list): -> list
     pos = []
     for i, val in enumerate(l):
             if val in values:
                     pos.append(i)
     return pos
smcrowley
  • 451
  • 3
  • 10
  • 2
    You *can't* use [keywords](https://docs.python.org/3/reference/lexical_analysis.html#keywords) as variable names, and `list` isn't one, and `l` is explicitly discouraged by PEP 8. – Kelly Bundy Jul 08 '22 at 17:40
0

You should definitely use Python built-in functions. For instance, instead of position(list, value) you could use comprehension

[n for n,x in enumerate(dna) if x == 'A']

Finding a bigram could be reduced to the above if you consider pairs of letters:

[n for n,x in enumerate(zip(dna[:-1], dna[1:])) if x==('A','T')]

If instead you want to find the positions of either 'A' or 'T', you could just specify that as the condition

[n for n,x in enumerate(dna) if x in ('A', 'T')]
Dima Chubarov
  • 16,199
  • 6
  • 40
  • 76
0

Python will efficiently find a substring of a string starting from any point.

def positions(dnalist, substr):
    dna = "".join(dnalist) # make single string
    st = 0
    pos = []
    while True: 
        a_pos = dna.find(substr, st)
        if a_pos < 0:
            return pos
        pos.append(a_pos)
        st = a_pos + 1

Test usage:

>>> testdna = ['A', 'G', 'C', 'G', 'T', 'A', 'G', 'T', 'C', 'G', 'A', 'T', 'C', 'A', 'A', 'T', 'T', 'A', 'T', 'A', 'C', 'G', 'A', 'T', 'C', 'G', 'G', 'G', 'T', 'A', 'T']
>>> positions(testdna, "AT")
[10, 14, 17, 22, 29]
Joffan
  • 1,485
  • 1
  • 13
  • 18
  • this will return only the first location of each base. not sure if all positions are wanted but that's what their method does for a single base so it should do the same for many bases – smcrowley Jul 08 '22 at 17:39
  • @smcrowley you are mistaken - this code returns all positions, including overlapping positions. I added a sample run of the code in interactive mode – Joffan Jul 08 '22 at 20:09
  • that's my bad, you're correct about it not stopping after the first find. but you're finding the position of "A" only when it's followed by "T", not getting the indices of all "A"s and "T"s – smcrowley Jul 08 '22 at 20:25
  • Absolutely true; that is my reading of the requirement. – Joffan Jul 08 '22 at 20:25