python regex: either or or both with separator

Question

I need a regular expression to match a, b or a;b.

I cannot write a|b|a;b because a and b contain named groups and if I try to do this I get an Exception: redefinition of group name 'a' as group 8; was group 3 at position 60.

a;?b does not work either because ab must not be matched.

How would you solve this? Is this possible with the re library? I have heard there is also a library called pyparsing. Would that be better suited for this problem?

Background: This is a follow up question to this one. Because it does not seem to be possible to pass through color codes in urwid or curses I am trying to decode the color codes I am getting from git so that urwid can reencode these colors.

To avoid problems with copy & paste I am leaving out the leading control character in the following regular expressions:

Working regex, except that it does not match [1m (bold) which is used in a test program:

reo_color_code = re.compile(
    r'\['
    r'((?P<series>[01]);)?'
    r'((?P<fgbg>[34])(?P<color>[0-7]))?'
    r'm'
)

Not compiling regex:

reo_color_code = re.compile(
    r'\['
    r'('
        r'((?P<series>[01]))'
        r'|'
        r'((?P<fgbg>[34])(?P<color>[0-7]))'
        r'|'
        r'((?P<series>[01]));((?P<fgbg>[34])(?P<color>[0-7]))'
    r')'
    r'm'
)

Throws the exception

re.error: redefinition of group name 'series' as group 8; was group 3 at position 60

You have a serious error in the question. The substitution does not work in regexes. Probably due to that misunderstanding, you cannot write the code. What is **a**? What is **b**? Your example OBVIOUSLY is not about a/b/a;b. And we don't need the context where you take from these **a** and **b**. — Gangnus, Dec 25 '19 at 19:29
@Gangnus sorry, but I don't understand what's wrong with the simplification to a and b. I explicitly stated the one thing that is causing the problem, namely that both contain named groups. What other details are important? — jakun, Dec 26 '19 at 07:31
The substitution works if it does not change the sense. Your substitution does change it - it causes the error. You CAN explain, using it, you CAN use a, b instead some strings, but you MUST define these a, b, naming all details that can cause problems. Not mention slightly, but define them. I wanted to help you, and I have answered many questions on regex, but I could not understand what you are talking about. — Gangnus, Dec 26 '19 at 21:10

dcg · Accepted Answer · 2019-12-25T20:21:58.373

What I'd do in this case wouldn't be try to build a single regex to solve the entire problem, instead I'd implement a method like the following (also using re but at different levels):

def get_info(s):
    if s.startswith('[') and s.endswith('m'):
        p = s[1:-1]
        if ';' in p:
            m = re.match('^([01]);([34])([0-7])$', p)
        else:
            m = re.match('^([01])$|^([34])([0-7])$', p)
        if m:
            return tuple(m.groups())
    return None, None, None

You can use it like:

>>> serie, fgbg, color = get_info('[1;37m')
>>> serie, fgbg, color
('1', '3', '7')

PS: Didn't do too many tests. Hope it helps.

score 1 · Answer 2 · answered Dec 27 '19 at 14:46

Here is a more general regex for cracking ANSI terminal sequences:

\[(\d+)(?:;(\d+))?([a-z])

If you want to access the groups by name, then use this:

\[(?P<d1>\d+)(?:;(?P<d2>\d+))?(?P<trailing>[a-z])

I didn't give the integer values any meaningful names, since they can vary depending on the trailing alpha character (and can also be >1 digit long).

For future regex development work, https://regex101.com is a great interactive page for working through the re kinks.

score 1 · Answer 3 · answered Dec 27 '19 at 15:05

Since you asked about pyparsing here is what a pyparsing parser would look like:

import pyparsing as pp

integer = pp.pyparsing_common.integer
ansi_expr = ("[" 
             + integer("d1") 
             + pp.Optional(';' + integer("d2")) 
             + pp.oneOf(list(pp.alphas.lower()))("trailing"))

ansi_expr.runTests("""\
    [1m
    [23;34z
    """)

With test output:

[1m
['[', 1, 'm']
- d1: 1
- trailing: 'm'

[23;34z
['[', 23, ';', 34, 'z']
- d1: 23
- d2: 34
- trailing: 'z'

python regex: either or or both with separator

3 Answers3