4

I'm looking for a regular expression that will identify a sequence in which an integer in the text specifies the number of trailing letters at the end of the expression. This specific example applies to identifying insertions and deletions in genetic data in the pileup format.

For example:

If the text I am searching is:

AtT+3ACGTTT-1AaTTa

I need to match the insertions and deletions, which in this case are +3ACG and -1A. The integer (n) portion can be any integer larger than 1, and I must capture the n trailing characters.

I can match a single insertion or deletion with [+-]?[0-9]+[ACGTNacgtn], but I can't figure out how to grab the exact number of trailing ACGTN's specified by the integer.

I apologize if there is an obvious answer here, I have been searching for hours. Thanks!

(UPDATE)

I typically work in Python. The one workaround I've been able to figure out with the re module in python is to call both the integers and span of every in/del and combine the two to extract the appropriate length of text.

For example:

>>> import re
>>> a = 'ATTAA$At^&atAA-1A+1G+4ATCG'
>>> expr = '[+-]?([0-9]+)[ACGTNacgtn]'
>>> ints = re.findall(expr, a) #returns a list of the integers
>>> spans = [i.span() for i in re.finditer(expr,a)]
>>> newspans = [(spans[i][0],spans[i][1]+(int(indel[i])-1)) for i in range(len(spans))]
>>> newspans
>>> [(14, 17), (17, 20), (20, 26)]

The resulting tuples allow me to slice out the indels. Probably not the best syntax, but it works!

Aaron Sams
  • 41
  • 3
  • 1
    That is impossible with regular expressions. Certain implementations of "regular" expressions allow this, but it will be more difficult and slower than performing the calculations outside of the expression. –  Jul 28 '12 at 04:52

3 Answers3

2

You can use regular expression substitution passing a function as replacement... for example

s = "abcde+3fghijkl-1mnopqr+12abcdefghijklmnoprstuvwxyz"

import re

def dump(match):
    start, end = match.span()
    print s[start:end + int(s[start+1:end])]

re.sub(r'[-+]\d+', dump, s)

#output
# +3fgh
# -1m
# +12abcdefghijkl
6502
  • 112,025
  • 15
  • 165
  • 265
0

It's not directly possible, regexes can't 'count' like that.

But if you're using a programming language that allows callbacks as a regex match evaluator (e.g. C#, PHP), then what you could do is have the regex as [+-]?([0-9]+)([ACGTNacgtn]+) and in the callback trim the trailing characters to the desired length.

e.g. for C#

var regexMatches = new List<string>();
Regex theRegex = new Regex(@"[+-]?([0-9]+)([ACGTNacgtn]+)");
text = theRegex.Replace(text, delegate(Match thisMatch)
{

    int numberOfInsertsOrDeletes = Convert.ToInt32(thisMatch.Groups[1].Value);
    string trailingString = thisMatch.Groups[2].Value;
    if (numberOfInsertsOrDeletes > trailingString.Length)
    { trailingString = trailingString.Substring(0, numberOfInsertsOrDeletes); }
    regexMatches.Add(trailingString);

    return thisMatch.Groups[0].Value;
});
Michael Low
  • 24,276
  • 16
  • 82
  • 119
  • Thanks mikel, your answer got me thinking about how to solve this problem in python. I've updated my question with the solution that I've found. – Aaron Sams Jul 28 '12 at 14:14
0

The simple Perl pattern for matching an integer followed by that number of any character is just:

 (\d+)(??{"." x $1})

which is quite straight-forward, I think you’ll agree. For example, this snippet:

my $string = "AtT+3ACGTTT-1AaTTa";

print "Matched $&\n" while $string =~ m{
    ( \d+ )            # capture an integer into $1
    (??{ "." x $1 })   # interpolate that many dots back into pattern
}xg;

Merrily prints out the expected

Matched 3ACG
Matched 1A

EDIT

Oh drat, I see you just added the Python tag since I began editing. Oops. Well, maybe this will be helpful to you anyway.

That said, if what you are actually looking for is fuzzy matching where you allow for some number of insertions and deletions (the edit distance), then Matthew Barnett’s regex library for Python will handle that. That doesn’t seem to be quite what you’re doing, as the insertions and deletions are actually represented in your strings.

But Matthew’s library is really very good and very interesting, and it even does many things that Perl cannot do. :) It’s a drop-in replacement for the standard Python re library.

tchrist
  • 78,834
  • 30
  • 123
  • 180