0

It is worth noting that regex engine stops eagerly searching as soon as it finds a match. Then, the order matters in certain situations because it will not continue checking the remaining options in the alternation.

The purpose in this regex is compound of two sections:

  1. Delete numbers at the beginning of the string as long as these numbers are not immediately before KINDER, SECONDARY, ELEMENTARY. It is straightforward, we can achieve this with the following:
    (^\d*\b )(?!(ELEMENTARY|SECONDARY|KINDER)) 
    
  2. Set together numbers and letters that made an ordinal number (explained in here). Just as an example, strings such as 10 st become 10st but strings such as abcdefg238947 th DO NOT change. The corresponding regex is the following:
    (?<=[0-9])\s+(?=(?:ST|[RN]D|TH)(?: +[^\W\d_]|$))
    
    The issue comes when adding up these two together. I understand that if I set the second rule in the first place, then the engine would succeed and continue parsing:
text= re.sub(r'(?<=[0-9])\s+(?=(?:ST|[RN]D|TH)(?: +[^\W\d_]|$))|(^\d*\b )(?!(ELEMENTARY|SECONDARY|KINDER))',
             '',
             '1 ST KINDER',
             0,
             re.IGNORECASE)

Having the following string, the engine should set together 1 and ST. Then having 1ST before KINDER should not match the second rule, but this is not the case:

1 ST KINDER  --> ST KINDER

More examples:

10306 KINDER  (OK)
12345 ABC     (OK)
1 ST KINDER   (SHOULD BE 1ST KINDER)
1 AB KINDER   (OK)

How can I set together in the SAME regex statement with alternation both rules giving priority to putting together numbers and letters if ordinal numbers and then check for digits at the beginning?

I would like the same behaviour as the following:

text= re.sub(r'^\d+\b(?!\s+(?:ELEMENTARY|SECONDARY|KINDER))',
             '',
             re.sub(r'(?<=[0-9])\s+(?=(?:ST|[RN]D|TH)(?: +[^\W\d_]|$))',
                    '',
                    '1 ST KINDER',
                    0,
                    re.IGNORECASE),
             0,
             re.IGNORECASE)
John Barton
  • 1,581
  • 4
  • 25
  • 51
  • You miss the fact that matches are searched for from left to right, and the first alternative that matches may grab the text that could be matched with other alternatives. BTW, `(^\d*\b )(?!(ELEMENTARY|SECONDARY|KINDER))` is wrong, you need `^\d+\b(?!\s+(?:ELEMENTARY|SECONDARY|KINDER))\s*` – Wiktor Stribiżew Jan 24 '20 at 21:28
  • Thanks @WiktorStribiżew, I was expecting `1 ST` to match the first element in the alternation which is `(?<=[0-9])\s+(?=(?:ST|[RN]D|TH)(?: +[^\W\d_]|$))` and transform the string to `1ST`. How this is wrong? – John Barton Jan 24 '20 at 21:36
  • That is not wrong. I think you need to fail all number at the start of string matches when followed with the ordinal numeral suffixes. Try `(?<=[0-9])\s+(?=(?:ST|[RN]D|TH)\b)|^\d+\b(?!\s+(?:ST|[RN]D|TH)\b)(?!\s+(?:ELEMENTARY|SECONDARY|KINDER))\s*`, see [this demo](https://regex101.com/r/8p1tOU/1) – Wiktor Stribiżew Jan 24 '20 at 21:41
  • Not sure what's wrong with the two-substitution method; if this needed to be super-fast tight code you probably wouldn't be writing it in Python, so why not use the easier-to-understand solution? You're fundamentally trying to do two different things - on the one hand, remove numbers, and on the other, leave the numbers but remove a space. Merging those into a single operation is not going to be straightforward, even with the power of regexes at your disposal. – Mark Reed Jan 24 '20 at 21:51
  • @WiktorStribiżew I checked the solution you provided, but it is failing with the string `1 ST ABC`, the `1ST` should disappear as `12345 ABC` is removed. – John Barton Jan 24 '20 at 22:00
  • I do not quite understand your rules. Right now, the rules are for two regex replacements. Define the ones for a combined pattern. – Wiktor Stribiżew Jan 24 '20 at 22:10
  • You are rigth, they should be seperated regex replacements. The combined pattern is that cardinals and ordinal numbers should be removed if they are not followed by `ELEMENTARY, SECONDARY, KINDER`. If they are followed, then the numbers are kept. – John Barton Jan 24 '20 at 22:15

0 Answers0