2

Let's say I have a string that can contain only A's, B's and C's.

I have substrings of a certain pattern that I want to extract: they start with ABC, continue with a combination of B's and C's, and end with CBA.

The naive solution is to use ABC[BC]*CBA.

However, that will not cover the ABCBA string. Is there a "pythonic" way to address this, other than using | to look for two possible RE's?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
MattS
  • 1,701
  • 1
  • 14
  • 20
  • 1
    Matt, you do not need to use lookarounds here, your regex is fine, the only fix to be done is to enclose the `C[BC]*` part with an optional group. – Wiktor Stribiżew May 29 '18 at 11:32
  • @WiktorStribiżew Thank you very much for your solution! Are optional groups more preferable than lookaround? – MattS May 29 '18 at 15:14
  • 1
    You may compare the steps: [34 (mine)](https://regex101.com/r/50I7oM/3) vs. [42 (Biffen's)](https://regex101.com/r/6UNzCc/1). – Wiktor Stribiżew May 30 '18 at 08:09

2 Answers2

5

You can use lookarounds:

AB(?=C)[BC]*(?<=C)BA

I.e. make sure AB is followed by C and BA is preceded by C, even if they are the same C.

Biffen
  • 6,249
  • 6
  • 28
  • 36
2

You do not need to use lookarounds, use an optional group:

ABC(?:[BC]*C)?BA

See the regex demo.

Details

  • ABC - an ABC substring
  • (?:[BC]*C)? - a non-capturing group matching 0 or more occurrences of B or C chars followed with a C letter
  • BA - a BA substring.

This will effectively match AB that can only be followed with C and then any number of B or C letters (but this steak of chars is optional) followed with CBA.

Note that depending on what you are doing with the pattern, a capturing group will also do, ABC([BC]*C)?BA.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563