1

I'm using regex in a python script to capture a named group. The group occurs before OR after a delimiter string "S". My confusion comes from an inability to use named capturing groups twice in the same regex.

I'd like to use the following invalid (named group used twice) regex:

(?:^STD_S_)(?P<important>.+?)$|(?:^(?P<important>.+?)(?:_S_STD)$

Description:

?: non-capture group ^STD_S_ Starting with some "STD_S_" string which is a standard string plus a delimiter

?P Named important string I want

| OR

^?P stat with important _S_STD$ end with standard

I would really like the important group I capture to be named. I can remove the names and get this to work. I can also split the single expression into two expressions (one from each side of the OR) and search choose which one to use with some login in the python script.

Thanks!

EXAMPLE INPUTS

STD_S_important
important_S_STD

EXAMPLE OUTPUTS

important #returned by calling the important named group
important

regex based on comments that doesn't match the second case.

(?:(?:^STD_S_)(?P<important>.+?)$)|(?:^(?P=important)(?:_S_STD)$)
outis
  • 75,655
  • 22
  • 151
  • 221
twinturbotom
  • 1,504
  • 1
  • 21
  • 34
  • Umm... you can reference a previous capture group... `(?P=important)` for the second one - not sure if that'd work in your example... Bit hard to test as you don't provide some sample input that stuff can be run on... – Jon Clements Aug 19 '16 at 01:31
  • I believe that either a) checking two regexes or b) first checking whether STD_S occurs at the beginning or end and then choosing the appropriate regex will be much easier – Victor Chubukov Aug 19 '16 at 01:38
  • That seems like it would be the solution but my first attempt failed. I'll update my question with a sample after a few more tests. – twinturbotom Aug 19 '16 at 01:39
  • Victor. The thought of using python to search a string for a condition with string.startswith() to select which regex to search seems redundant.... but it is a solution. – twinturbotom Aug 19 '16 at 01:42
  • Shrug. All up to you, but I think it will be the most readable, and I don't see why it would be any less efficient. – Victor Chubukov Aug 19 '16 at 01:52
  • 1
    @twinturbotom is the example inputs only showing the use of STD_S (which your original used) as you now appear to have introduced STD_S_ for the start ad _S_STD for the end... – Jon Clements Aug 19 '16 at 01:54
  • Use PyPi regex module with [`^(?:STD_S_(?P.+)|(?P.+?)_S_STD)$`](https://regex101.com/r/kR5xZ2/1). – Wiktor Stribiżew Aug 19 '16 at 21:42

1 Answers1

1

Note the general form of the regex is: A(?P<name>B)|(?P<name>B)C. Since a name can't be repeated for named groups, it must go around the whole expression. This causes another issue: it captures the prefix and suffix in the named group. To resolve this, you can use lookarounds to prevent the prefix and suffix from being captured within the group.

(?P<name>(?<=A)B|B(?=C))

Note that this only works when the prefix is of fixed length. If part of the prefix or suffix themselves should be captured, you can add capturing groups to the lookarounds. Anchors cannot be placed next to the lookarounds but must instead be put in them, else they will create mutually exclusive requirements.

# can succeed:
(?P<name>(?<=^A)B$|^B(?=C$))

# always fails:
(?P<name>^(?<=A)B$|^B(?=C)$)
^(?P<name>(?<=A)B|B(?=C))$

For the regex in question, this gives:

(?P<important>(?<=^STD_S_).+$|^.+(?=_S_STD$))

(RegEx101 demo)

Alternatively, the regex module allows the same group name to be used for multiple groups, with the last capture taking precedence.

outis
  • 75,655
  • 22
  • 151
  • 221