0

I have a series of regex patterns that get grouped into categories. I'm trying to statically compile them into alternations, but I don't want any special meanings to get lost.

As an example, I'm identifying Raspberry Pi GPIO pins by name. There are the GPIO pins, 0-27 (coincidentally the same numbers as the BCP nomenclature), the voltage reference pins, and the named-function pins. Depending upon into which category a particular physical pin falls, assumptions can be made; for example, a voltage-reference pin never has a pull-up status nor a GPIO/BCM number.

So:

_cats = {
    'data': (
        r'gpio\.?([0-9]|[1-3][0-9]|40)',
    ),
    'vref': (
        r'v3_3',
        r'v5'
        r'gnd',
    ),
    'named': (
        r'SDA\.?([01])',
        r'CE\.?0'
        r'CE\.?1',
    ),
}

The first thing I want to do is combine all of the patterns into a single compiled alternation so I can check whether an input actually matches any of my keys. For a single string in the first dict, I could simply use:

crx = rx.compile(('\A' + _cats['data'][0] + '\Z'), rx.IGNORECASE)

For all of them, I could do something like:

crx = re.compile('|'.join([('\A' + rx + '\Z') for rx in _cats['vref']]), re.IGNORECASE)

but this is starting to confuse me. Each regex term should be ^$ or \A\Z bounded, but joining them into alternations and then compiling them is giving me issues.

I'm looking for something like Emacs' regexp-opt function.

I've tried variations on the theme described, and getting syntax errors, patterns that don't match anything, and patterns that match too much.

Edit

Thanks for the comments which helped clarify and solve my main question, but I think the second part got lost somewhere. Specifically,

Is a compiled regex itself a regular expression, or is it a sort of opaque end-point? Would this (p-codish) work?

rx_a = re.compile(r'(?:a|1|#)')
rx_b = re.compile(r'(?:[b-z]|[2-9]|@)')
rx_c = re.compile('|'.join([repr(rx_a), repr(rx_b)]))

Or something of the sort?

RoUS
  • 1,888
  • 2
  • 14
  • 29
  • The original question pointed out that I was using Python, but it was immediatelyt elided by a moderator. So, Python. And the backticks were accidentally inserted for formatting, not semantics. – RoUS Aug 07 '23 at 16:15
  • 1
    Thanks, spotted the errors and fixed. – RoUS Aug 07 '23 at 16:21
  • 1
    It's not necessary to say the language in the title, it's in the tags. – Barmar Aug 07 '23 at 16:25
  • Noted for future reference and behavioural correction, although it seems to cater to people who update multiple fields rather than simply looking for "doing X in Python." {shrug} – RoUS Aug 07 '23 at 16:28
  • 1
    They should search for `[python] doing X` – Barmar Aug 07 '23 at 16:30
  • Anyway, what is your question? Now that you fixed the typo, your code looks good. – Barmar Aug 07 '23 at 16:35
  • It would be best to put `(?:...)` around each alternative, in case it contains `|` – Barmar Aug 07 '23 at 16:36
  • Actually, the groups within regex items, such as `gpio\.?([0-9]|[1-3][0-9]|40)` are actually there in order to be able to extract those portions of the match. Are those the ones you meant? – RoUS Aug 07 '23 at 16:39
  • No, I meant when you're joining them together, add `(?:` and `)` around each `rx`. `'\A(?:' + rx + ')\Z'` – Barmar Aug 07 '23 at 16:41

1 Answers1

1

I believe you can achieve all of your requirements with regex named groups.

The syntax is: (?P<NAME>EXPRESSION).

import re

#expressions
DATA  = r'(?P<data>gpio\.?([0-9]|[1-3][0-9]|40))'
VREF  = r'(?P<vref>v3_3|v5|gnd)'
NAMED = r'(?P<named>SDA\.?([01])|CE\.?[01])'

#compiled expression
search = re.compile(fr'{DATA}|{VREF}|{NAMED}').search

#find
if m:=search(YourData):
    print(m.group('data'))
    print(m.group('vref'))
    print(m.group('named'))
OneMadGypsy
  • 4,640
  • 3
  • 10
  • 26