Regular Expression - how to capture a number of characters specified in the string

Question

I'm trying to use regular expressions to extract a digit as well as a number of characters equal to that digit from a string. This is for analyzing a pileup summary output from samtools mpileup (see here). I'm doing this is python.

As an example, let's say I have the following string:

.....+3AAAT.....

I am trying to extract the +3AAA from the string, leaving us with:

.....T.....

Note that the T remains, because I only wanted to extract 3 characters (because the string indicated that 3 should be extracted).

I could do the following:

re.sub("\+[0-9]+[ACGTNacgtn]+", "", ".....+3AAAT.....")

But this would cut out the T as well, leaving us with:

..........

Is there a way to use the information in a string to adjust the pattern in a regular expression? There are ways I could go around using regular expressions to do this, but if there's a way regular expressions can do it I'd rather use that way.

What happens if the number is 0 or negative? Should the second half of the string be cut if the number is greater than the middle part's length? — InSync, Apr 27 '23 at 14:41
in this part:[ACGTNacgtn] there is a t. Maybe you could make your regex case sensitive? — ImBadAtMath, Apr 27 '23 at 14:42
The '+' symbol could alternatively be '-', but the principle remains the same - if there's a number, I want to extract the characters after that number. This format would not have a 0, though theoretically if you saw one you would simply extract the digit, and none of the following characters. — CCranney, Apr 27 '23 at 14:51

score 1 · Accepted Answer · answered Apr 27 '23 at 14:55

You can pass a lambda to re.sub():

import re

def replace(string):
  replaced = re.sub(
    r'\+([0-9]+)([ACGTNacgtn]+)',
    # group(1) = '3', group(2) = 'AAAT'
    lambda match: match.group(2)[int(match.group(1)):],
    string
  )
  return replaced

Try it:

string = '.....+3AAAT.....'
print(replace(string))  # '.....T.....'

string = '.....+10AAACCCGGGGTN.....'
print(replace(string))  # '.....TN.....'

string = '.....+0AN.....'
print(replace(string))  # '.....AN.....'

string = '.....+5CAGN.....'
print(replace(string))  # '..........'

score 0 · Answer 2 · answered Apr 27 '23 at 14:54

There is an (ill-advised) purely regex-based solution, matching each possible number separately:

import re

MAX_NUMBER = 10

regex = re.compile(
    r"\+(?:" + "|".join(f"{d}[acgtn]{{{d}}}" for d in range(MAX_NUMBER)) + ")",
    flags=re.IGNORECASE,
)
regex.sub("", ".....+3AAAT.....")

This makes regex represent the following monster.

\+(?:0[acgtn]{0}|1[acgtn]{1}|2[acgtn]{2}|3[acgtn]{3}|4[acgtn]{4}|5[acgtn]{5}|6[acgtn]{6}|7[acgtn]{7}|8[acgtn]{8}|9[acgtn]{9})

({0} and {1} are a bit silly, but it may not be worth the effort to fix them.)

Regular Expression - how to capture a number of characters specified in the string

2 Answers2