It is worth noting that regex engine stops eagerly searching as soon as it finds a match. Then, the order matters in certain situations because it will not continue checking the remaining options in the alternation.
The purpose in this regex is compound of two sections:
- Delete numbers at the beginning of the string as long as these
numbers are not immediately before KINDER, SECONDARY, ELEMENTARY.
It is straightforward, we can achieve this with the following:
(^\d*\b )(?!(ELEMENTARY|SECONDARY|KINDER))
- Set together numbers and letters that made an ordinal number
(explained in here). Just as an example, strings such as
10 st
become10st
but strings such asabcdefg238947 th
DO NOT change. The corresponding regex is the following:
The issue comes when adding up these two together. I understand that if I set the second rule in the first place, then the engine would succeed and continue parsing:(?<=[0-9])\s+(?=(?:ST|[RN]D|TH)(?: +[^\W\d_]|$))
text= re.sub(r'(?<=[0-9])\s+(?=(?:ST|[RN]D|TH)(?: +[^\W\d_]|$))|(^\d*\b )(?!(ELEMENTARY|SECONDARY|KINDER))',
'',
'1 ST KINDER',
0,
re.IGNORECASE)
Having the following string, the engine should set together 1
and ST
. Then having 1ST
before KINDER should not match the second rule, but this is not the case:
1 ST KINDER --> ST KINDER
More examples:
10306 KINDER (OK)
12345 ABC (OK)
1 ST KINDER (SHOULD BE 1ST KINDER)
1 AB KINDER (OK)
How can I set together in the SAME regex statement with alternation both rules giving priority to putting together numbers and letters if ordinal numbers and then check for digits at the beginning?
I would like the same behaviour as the following:
text= re.sub(r'^\d+\b(?!\s+(?:ELEMENTARY|SECONDARY|KINDER))',
'',
re.sub(r'(?<=[0-9])\s+(?=(?:ST|[RN]D|TH)(?: +[^\W\d_]|$))',
'',
'1 ST KINDER',
0,
re.IGNORECASE),
0,
re.IGNORECASE)