8

I want to replace the text of matched re patterns in a string and can do this using re.sub(). If I pass it a function as the repl argument in the call it works as desired, as illustrated below:

from __future__ import print_function
import re

pattern = r'(?P<text>.*?)(?:<(?P<tag>\w+)>(?P<content>.*)</(?P=tag)>|$)'

my_str = "Here's some <first>sample stuff</first> in the " \
            "<second>middle</second> of some other text."

def replace(m):
    return ''.join(map(lambda v: v if v else '',
                        map(m.group, ('text', 'content'))))

cleaned = re.sub(pattern, replace, my_str)
print('cleaned: {!r}'.format(cleaned))

Output:

cleaned: "Here's some sample stuff in the middle of some other text."

However from the documentation it sounds like I should be able to get the same results by just passing it a replacement string with references to the named groups in it. However my attempt to do that didn't work because sometimes a group is unmatched and the value returned for it is None (rather than an empty string '').

cleaned = re.sub(pattern, r'\g<text>\g<content>', my_str)
print('cleaned: {!r}'.format(cleaned))

Output:

Traceback (most recent call last):
  File "test_resub.py", line 21, in <module>
    cleaned = re.sub(pattern, r'\g<text>\g<content>', my_str)
  File "C:\Python\lib\re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "C:\Python\lib\re.py", line 278, in filter
    return sre_parse.expand_template(template, match)
  File "C:\Python\lib\sre_parse.py", line 802, in expand_template
    raise error, "unmatched group"
sre_constants.error: unmatched group

What am I doing wrong or not understanding?

martineau
  • 119,623
  • 25
  • 170
  • 301
  • The `content` of the last match is None... – kennytm Dec 23 '14 at 22:03
  • @KennyTM: I know some of the match groups will be `None`, which is why I use the `lambda v: v if v else ''` in the `replace()` function. Is something like that needed in the replacement string and, if so, how is it done? – martineau Dec 24 '14 at 00:12

2 Answers2

7
def repl(matchobj):
    if matchobj.group(3):
        return matchobj.group(1)+matchobj.group(3)
    else:
        return matchobj.group(1)

my_str = "Here's some <first>sample stuff</first> in the " \
        "<second>middle</second> of some other text."

pattern = r'(?P<text>.*?)(?:<(?P<tag>\w+)>(?P<content>.*)</(?P=tag)>|$)'
print re.sub(pattern, repl, my_str)

You can use the call function of re.sub.

Edit: cleaned = re.sub(pattern, r'\g<text>\g<content>', my_str) this will not work as when the last bit of string matches i.e of some other text. there is \g<text> defined but no \g<content> as there is not content.But you still ask re.sub to do it.So it generates the error.If you use the string "Here's some <first>sample stuff</first> in the <second>middle</second>" then your print re.sub(pattern,r"\g<text>\g<content>", my_str) will work as \g<content> is defined all the time here.

vks
  • 67,027
  • 10
  • 91
  • 124
  • 1
    I know you can pass a function to `re.sub()` -- that's what the first bit code in my question does. I'd like to know how to do by passing a replacement string containing references to the named (or numbered) groups. – martineau Dec 24 '14 at 08:53
  • So you're saying there's no way to handle when a group in the pattern wasn't matched -- even though the pattern as a whole was matched since it allows zero or more occurrences of the group) -- except by passing `re.sub()` a function? I was hoping that wasn't true and there was some conditional form of referencing a named capturing group. – martineau Dec 24 '14 at 09:18
  • @martineau that's why we can use function over there for this type of situations – vks Dec 24 '14 at 09:19
  • 1
    Sounds like a major shortcoming to me -- and allowing the user to supply a function is just a way to workaround it. I'm tempted to accept your edited answer, but will wait a while first to see if there are any others. – martineau Dec 24 '14 at 09:27
  • Nope. There's a pypi module named `regex` that gives such groups the value `''` instead of `None` -- like Perl and PCRE do -- unfortunately Python's `re` modules doesn't have a flag for that...guess I have use the function version of the argument. – martineau Dec 30 '14 at 17:05
1

If I understand correctly, you want to remove everything between < > inclusive:

>>> import re

>>> my_str = "Here's some <first>sample stuff</first> in the <second>middle</second> of some other text."

>>> print re.sub(r'<.*?>', '', my_str)

Here's some sample stuff in the middle of some other text.

To somewhat explain what's going on here... the r'<.*?>':

< finds the first <

. then accept any character

* accept any character any number of times

? limit the result to the shortest possible, without this, it would go until the last > instead of the first available one

> find the closing point >

Then, replace everything between those two points with nothing.

MrAlexBailey
  • 5,219
  • 19
  • 30
  • I want to remove, as in replace with nothing, all ``s and corresponding ``s, but keep everything else including anything that's between them. – martineau Dec 24 '14 at 00:14
  • 2
    My sub suggestion should do exactly that... If it's missing something post the full string you are working with so we can help further. – MrAlexBailey Dec 24 '14 at 00:50
  • The entire string is in my question, it's in the `my_str` variable. I have named capturing groups in my regex and would like to reference them in a replacement string passed to `re.sub()`. Your sub suggestion isn't doing that so doesn't seem helpful because it's not answering the question I asked which is how to reference them in a replacement string without error. – martineau Dec 24 '14 at 01:46