2

I am trying to parse the text output from samtools mpileup. I start with a string

s = '.$......+2AG.+2AG.+2AGGG'

Whenever I have a + followed by an integer n, I would like to select n characters following that integer and replace the whole thing by *. So for this test case I would have

'.$......+2AG.+2AG.+2AGGG' ---> '.$......*.*.*GG' 

I have the regex \+[0-9]+[ACGTNacgtn]+ but that results in the output .$......*.*.* and the trailing G's are lost as well. How do I select n characters where the n is not known ahead of time but specified in the string itself?

vk673
  • 23
  • 3

2 Answers2

1

The repl argument in re.sub can be a string or a function.

So, you can do very complex things with function replacements:

def removechars(m):
    x=m.group()
    n=re.match(r'\+(\d+).*', x).group(1) # digit part
    return '*'+x[1+len(n)+int(n):]

Solves your problem:

>>> re.sub(r'\+[0-9]+[ACGTNacgtn]+', removechars, s)
'.$......*.*.*GG'
fferri
  • 18,285
  • 5
  • 46
  • 95
0

Not the most elegant, but I pulled out the numeric values using re.findall before running re.sub.

ls=re.findall('\+(\d)',s)

for i in ls:
    s=re.sub('\+(%s\w{%s})' % (i,i),'*',s)