Finding regular expression with at least one repetition of each letter

Question

From any *.fasta DNA sequence (only 'ACTG' characters) I must find all sequences which contain at least one repetition of each letter.

For examle from sequence 'AAGTCCTAG' I should be able to find: 'AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG' and 'CTAG' (iteration on each letter).

I have no clue how to do that in pyhton 2.7. I was trying with regular expressions but it was not searching for every variants.

How can I achive that?

Hi! Could you please include the code you tried that failed, as well as reformat your question to more explicitly state the inputs, expected outputs, and approach? This will make it easier for us to help :) — NBlaine, Apr 25 '18 at 19:23
`re.findall(r'(A+?C+?T+?G+?)',seqs)` input is [this](https://www.ncbi.nlm.nih.gov/nuccore/NC_000012.12?report=fasta&from=69348354&to=69354233) I have to search for a 'words' in DNA sequence and each word must contain at least one A,C,T ang G in it. Word 'ends' when all of those letters are in it. That's all I know — Michał Kowalski, Apr 25 '18 at 19:30
Even if you do go with regular expressions, you would need overlapping. — user3483203, Apr 25 '18 at 19:36

user3483203 · Accepted Answer · 2018-04-25T19:42:10.393

You could find all substrings of length 4+, and then down select from those to find only the shortest possible combinations that contain one of each letter:

s = 'AAGTCCTAG'

def get_shortest(s):
  l, b = len(s), set('ATCG')
  options = [s[i:j+1] for i in range(l) for j in range(i,l) if (j+1)-i > 3]
  return [i for i in options if len(set(i) & b) == 4 and (set(i) != set(i[:-1]))]

print(get_shortest(s))

Output:

['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']

score 0 · Answer 2 · answered Apr 25 '18 at 19:41

This is another way you can do it. Maybe not as fast and nice as chrisz answere. But maybe a little simpler to read and understand for beginners.

DNA='AAGTCCTAG'
toSave=[]
for i in range(len(DNA)):
    letters=['A','G','T','C']
    j=i
    seq=[]
    while len(letters)>0 and j<(len(DNA)):
        seq.append(DNA[j])
        try:
            letters.remove(DNA[j])
        except:
            pass
        j+=1
    if len(letters)==0:
        toSave.append(seq)

print(toSave)

score 0 · Answer 3 · answered Apr 25 '18 at 19:41

Since the substring you are looking for may be of about any length, a LIFO queue seems to work. Append each letter at a time, check if there are at least one of each letters. If found return it. Then remove letters at the front and keep checking until no longer valid.

def find_agtc_seq(seq_in):
    chars = 'AGTC'
    cur_str = []
    for ch in seq_in:
        cur_str.append(ch)
        while all(map(cur_str.count,chars)):
            yield("".join(cur_str))
            cur_str.pop(0)

seq = 'AAGTCCTAG'
for substr in find_agtc_seq(seq):
    print(substr)

That seems to result in the substrings you are looking for:

AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG

ctwheels · Answer 4 · 2018-04-26T15:18:24.210

I really wanted to create a short answer for this, so this is what I came up with!

See code in use here

s = 'AAGTCCTAG'
d = 'ACGT'
c = len(d)

while c <= len(s):
    x,c = s[:c],c+1
    if all(l in x for l in d):
        print(x)
        s,c = s[1:],len(d)

It works as follows:

c is set to the length of the string of characters we are ensuring exist in the string (d = ACGT)
The while loop iterates over each possible substring of s such that c is smaller than the length of s.
- This works by increasing c by 1 upon each iteration of the while loop.
- If every character in our string d (ACGT) exist in the substring, we print the result, reset c to its default value and slice the string by 1 character from the start.
- The loop continues until the string s is shorter than d

Result:

AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG

To get the output in a list instead (see code in use here):

s = 'AAGTCCTAG'
d = 'ACGT'
c,r = len(d),[]

while c <= len(s):
    x,c = s[:c],c+1
    if all(l in x for l in d):
        r.append(x)
        s,c = s[1:],len(d)

print(r)

Result:

['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']

score -1 · Answer 5 · answered Apr 25 '18 at 19:25

If you can break the sequence into a list, e.g. of 5-letter sequences, you could then use this function to find repeated sequences.

from itertools import groupby
import numpy as np

def find_repeats(input_list, n_repeats):
    flagged_items = []

    for item in input_list:
        # Create itertools.groupby object
        groups = groupby(str(item))

        # Create list of tuples: (digit, number of repeats)
        result = [(label, sum(1 for _ in group)) for label, group in groups]

        # Extract just number of repeats
        char_lens = np.array([x[1] for x in result])   

        # Append to flagged items
        if any(char_lens >= n_repeats):
            flagged_items.append(item)

    # Return flagged items
    return flagged_items

#--------------------------------------
test_list = ['aatcg', 'ctagg', 'catcg']

find_repeats(test_list, n_repeats=2)  # Returns ['aatcg', 'ctagg']

Hm, it's correct on my laptop when I run it. Could you share your output? — Matt Sosna, Apr 25 '18 at 19:32
Ah, I see. I misunderstood OP's question. Thanks for the catch. This function will just return items in a list that have any repeated characters; it won't break the string in the way OP was looking for. — Matt Sosna, Apr 25 '18 at 19:37

Finding regular expression with at least one repetition of each letter

5 Answers5