-1

I want to know if it's possible to return a list of matches with a regex pattern consisting of a specific, consecutively repeating string, e.g. "ADDD". This may sound trivial, in fact according to regexpal.com, it should be as simple as this "(AGATC)\1+": result in regexpal.com. And using re.findall, as stated in the documentation, should return a list with all those matches. However, when using this code:

pattern = r"(AGATC)\\1+"
list_of_results = re.findall(pattern, seq_string)
print("list of results:", list_of_results)

where seq_string is the string where I'm looking for the pattern and is the same as the one used in the image of regexpal, I get an array of 1 element containing the pattern ('AGATC').

Is it possible to do what I need? Maybe I'm overlooking something?

Jorge Pasco
  • 27
  • 1
  • 8
  • can you post the sample input and expected output – deadshot Jun 17 '20 at 07:13
  • @komatiraju033 yes! If you look at the photo, that’s the sample input, and the expected output is a list with the matching string that is highlighted, or strings if there’s more than one. I think it’s easier to understand by seeing the photo, rather than copy pasting the actual string here, since it’d be hard to see where the desired patterns are inside it. – Jorge Pasco Jun 17 '20 at 07:20
  • You need to define your regex either as `pattern = r"(AGATC)\1+"` or `pattern = "(AGATC)\\1+"`. Use `re.finditer` instead (`[x.group() for x in re.finditer(pattern, s)]`) (as explained [here](https://stackoverflow.com/a/31915134/3832970)). – Wiktor Stribiżew Jun 17 '20 at 09:13

2 Answers2

0

Try this:

import re

res = re.findall('AGATC', seq_string)
print(res)
deadshot
  • 8,881
  • 4
  • 20
  • 39
0

Your issue is that re.findall will only return the contents of capture groups if there is one (or more) present in the regex. You can work around this by using an outer group to capture the entire match e.g.:

pattern = r"((AGATC)\2+)"
list_of_results = re.findall(pattern, seq_string)
print("list of results:", list_of_results)

which will give you a result something like:

[('AGATCAGATCAGATC', 'AGATC')]

You can use a list comprehension to only return the first value from each result e.g.

list_of_results = [g[0] for g in re.findall(pattern, seq_string)]

to get something like:

['AGATCAGATCAGATC']

Or you can use re.finditer and build your list from the match objects it produces:

pattern = r"(AGATC)\1+"
list_of_results = [m.group() for m in re.finditer(pattern, seq_string)]
print("list of results:", list_of_results)

which will give you a result like:

['AGATCAGATCAGATC']
Nick
  • 138,499
  • 22
  • 57
  • 95