I have a list containing some substitutions which I need to keep. For instance, the substitution list: ['1st', '2nd', '10th', '100th', '1st nation', 'xlr8', '5pin', 'h20']
.
In general, strings containing alphanumeric characters need to split numbers and letters as follows:
text = re.sub(r'(?<=\d)(?=[^\d\s])|(?<=[^\d\s])(?=\d)', ' ', text, 0, re.IGNORECASE)
The previous regex pattern is separating successfully all numbers from characters by adding space between in the following:
Original Regex
ABC10 DEF --> ABC 10 DEF
ABC DEF10 --> ABC DEF 10
ABC 10DEF --> ABC 10 DEF
10ABC DEF --> 10 ABC DEF
However, there are some alphanumeric words that are part of the substitution list which cannot be separated. For instance, the following string containing 1ST
which is part of substitution list should not been separated and they should be omitted instead of adding an space:
Original Regex Expected
1ST DEF 100CD --> 1 ST DEF 100 CD --> 1ST DEF 100 CD
ABC 1ST 100CD --> ABC 1 ST 100 CD --> ABC 1ST 100 CD
100TH DEF 100CD -> 100 TH DEF 100 CD -> 100TH DEF 100 CD
10TH DEF 100CD -> 10 TH DEF 100 CD -> 10TH DEF 100 CD
To get the expected column in the above example, I tried to use IF THEN ELSE
approach in regex, but I am getting an error in the syntax in Python:
(?(?=condition)(then1|then2|then3)|(else1|else2|else3))
Based on the syntax, I should have something like the following:
?(?!1ST)((?<=\d)(?=[^\d\s])|(?<=[^\d\s])(?=\d)))
where (?!...)
would include the possible substitutions to avoid when matching the regex pattern, in this case the words 1ST 10TH 100TH
.
How can I avoid matching word substitutions in the string?