Regex in Python: Separate words from numbers JUST when not in list

Question

I have a list containing some substitutions which I need to keep. For instance, the substitution list: ['1st', '2nd', '10th', '100th', '1st nation', 'xlr8', '5pin', 'h20'].

In general, strings containing alphanumeric characters need to split numbers and letters as follows:

text = re.sub(r'(?<=\d)(?=[^\d\s])|(?<=[^\d\s])(?=\d)', ' ', text, 0, re.IGNORECASE)

The previous regex pattern is separating successfully all numbers from characters by adding space between in the following:

Original       Regex
ABC10 DEF  --> ABC 10 DEF
ABC DEF10  --> ABC DEF 10
ABC 10DEF  --> ABC 10 DEF
10ABC DEF  --> 10 ABC DEF

However, there are some alphanumeric words that are part of the substitution list which cannot be separated. For instance, the following string containing 1ST which is part of substitution list should not been separated and they should be omitted instead of adding an space:

Original            Regex                Expected
1ST DEF 100CD  -->  1 ST DEF 100 CD  --> 1ST DEF 100 CD
ABC 1ST 100CD  -->  ABC 1 ST 100 CD  --> ABC 1ST 100 CD
100TH DEF 100CD ->  100 TH DEF 100 CD -> 100TH DEF 100 CD
10TH DEF 100CD  ->  10 TH DEF 100 CD  -> 10TH DEF 100 CD

To get the expected column in the above example, I tried to use IF THEN ELSE approach in regex, but I am getting an error in the syntax in Python:

(?(?=condition)(then1|then2|then3)|(else1|else2|else3))

Based on the syntax, I should have something like the following:

?(?!1ST)((?<=\d)(?=[^\d\s])|(?<=[^\d\s])(?=\d)))

where (?!...) would include the possible substitutions to avoid when matching the regex pattern, in this case the words 1ST 10TH 100TH.

How can I avoid matching word substitutions in the string?

You misunderstand the word "conditional". The construct won't help. You may use negative lookaheads to restrict digit checking, like `(?<=(?!1ST\b)\d)(?=[^\d\s])|(?<=[^\d\s])(?=(?!1ST\b)\d)`, see [demo](https://regex101.com/r/rvn4NT/1) — Wiktor Stribiżew, Jan 15 '20 at 00:12
Another idea: `re.sub(r'\s*(?!(?<!\d)1ST\b)(\d+)\s*', r' \1 ', text).strip()` — Wiktor Stribiżew, Jan 15 '20 at 00:22
thanks @WiktorStribiżew. It worked as a charm. Can you please recommend some regex books to learn in detail look arounds? I am still unsure to understand the details how the regex is processing step by step. — John Barton, Jan 15 '20 at 00:43
Do other answers help you or shall I post any of the solutions above with explanation? — Wiktor Stribiżew, Jan 15 '20 at 08:05
Thanks @WiktorStribiżew, your solution worked, but I am kind of confused in the nested negative lookahead in the lookbehind. I am trying to figure out how regex is going step by step to add space when is different to `1ST` and not adding space when is `1ST`. — John Barton, Jan 15 '20 at 22:14
Sorry, you cannot use spaces freely in the comments, could you please add that to the question? I understand `re.sub(r'\s*(?!(?<!\d)1ST\b)(\d+)\s*', r' \1 ', text).strip()` works, right? — Wiktor Stribiżew, Jan 15 '20 at 23:39
Why applying similar regex `(?<=(?!12TH\b)\d{1,2})(?=[^\d\s])|(?<=[^\d\s])(?=(?!12TH\b)\d{1,2})` does not work for the string `WEST 12TH APARTMENT`. I am getting the message `{1,2} A quantifier inside a lookbehind makes it non-fixed width`. How can I make the regex suitable for situations such as 1ST, 10TH, 100TH, etc — John Barton, Jan 16 '20 at 01:10
So, the `python-regex` tag is not relevant? I thought you knew it means you are using the [PyPi regex module](https://pypi.org/project/regex/). `re` does not support unknown length matching quantifiers in lookbehinds. — Wiktor Stribiżew, Jan 16 '20 at 07:51
I posted a full Python `re` based solution [below](https://stackoverflow.com/a/59765443/3832970) — Wiktor Stribiżew, Jan 16 '20 at 08:20

Nick · Answer 1 · 2020-01-15T05:25:54.497

You can do this with a lambda function to check whether the matched string was in your exclusion list:

import re

subs = ['1st','2nd','1st nation','xlr8','5pin','h20']
text = """
ABC10 DEF
1ST DEF 100CD
ABC 1ST 100CD
AN XLR8 45X
NO H20 DEF
A4B PLUS
"""

def add_spaces(m):
    if m.group().lower() in subs:
        return m.group()
    res = m.group(1)
    if len(res):
        res += ' '
    res += m.group(2)
    if len(m.group(3)):
        res += ' '
    res += m.group(3)
    return res

text = re.sub(r'\b([^\d\s]*)(\d+)([^\d\s]*)\b', lambda m: add_spaces(m), text)
print(text)

Output:

ABC 10 DEF
1ST DEF 100 CD
ABC 1ST 100 CD
AN XLR8 45 X
NO H20 DEF
A 4 B PLUS

You can simplify the lambda function to

def add_spaces(m):
    if m.group().lower() in subs:
        return m.group()
    return m.group(1) + ' ' + m.group(2) + ' ' + m.group(3)

but this might result in extra whitespace in the output string. That could then be removed with

text = re.sub(r' +', ' ', text)

score 2 · Answer 2 · answered Jan 15 '20 at 06:21

Another way using regex, (*SKIP)(*FAIL) and f-strings:

import regex as re

lst = ['1st','2nd','1st nation','xlr8','5pin','h20']

data = """
ABC10 DEF
ABC DEF10
ABC 10DEF
10ABC DEF
1ST DEF 100CD
ABC 1ST 100CD"""

rx = re.compile(
    rf"""
    (?:{"|".join(item.upper() for item in lst)})(*SKIP)(*FAIL)
    |
    (?<=\d)(?=[^\d\s])|(?<=[^\d\s])(?=\d)
    """, re.X)

data = rx.sub(' ', data)
print(data)

This yields

ABC 10 DEF
ABC DEF 10
ABC 10 DEF
10 ABC DEF
1ST DEF 100 CD
ABC 1ST 100 CD

score 1 · Accepted Answer · answered Jan 16 '20 at 08:19

When you deal with exceptions, the easiest and safest way is to use a "best trick ever" approach. When replacing, this trick means: keep what is captured, remove what is matched or vice versa. In regex terms, you must use an alternation and use a capturing group around one (or some in complex scenarios) of them to be able to analyze the match structure after the match is encountered.

So, at first, use the exception list to build the first part of the alternation:

exception_rx = "|".join(map(re.escape, exceptions))

Note re.escape adds backslashes where needed to support any special characters in the exceptions. If your exceptions are all alphanumeric, you do not need that and you can just use exception_rx = "|".join(exceptions). Or even exception_rx = rf'\b(?:{"|".join(exceptions)})\b' to only match them as whole words.

Next, you need the pattern that will find all matches regardless of context, the one I already posted:

generic_rx = r'(?<=\d)(?=[^\d\s])|(?<=[^\d\s])(?=\d)'

Finally, join them using the (exceptions_rx)|generic_rx scheme:

rx = re.compile(rf'({exception_rx})|{generic_rx}', re.I)

and replace using .sub():

s = rx.sub(lambda x: x.group(1) or " ", s)

Here, lambda x: x.group(1) or " " means return Group 1 value if Group 1 matched, else, replace with a space.

See the Python demo:

import re

exceptions = ['1st','2nd','10th','100th','1st nation','xlr8','5pin','h20', '12th'] # '12th' added
exception_rx = '|'.join(map(re.escape, exceptions))
generic_rx = r'(?<=\d)(?=[^\d\s])|(?<=[^\d\s])(?=\d)'
rx = re.compile(rf'({exception_rx})|{generic_rx}', re.I)

string_lst = ['1ST DEF 100CD','ABC 1ST 100CD','WEST 12TH APARTMENT']
for s in string_lst:
    print(rx.sub(lambda x: x.group(1) or " ", s))

Output:

1ST DEF 100 CD
ABC 1ST 100 CD
WEST 12TH APARTMENT

In the following expression `rx.sub(lambda x: x.group(1) or " ",s)`, x in lambda is getting the string for each iteration such as `1ST DEF 100CD`. However, if I try directly s.group(1) I am getting `str object has no attribute group`. What is this x in lambda if it is not the string coming from string_lst? — John Barton, Jan 16 '20 at 18:49
Why I cannot get the same result with text= re.sub(r'({exception_rx})|{generic_rx}',r'\1' or ' ', s)? — John Barton, Jan 16 '20 at 19:16
@Juan You cannot use a string replacement pattern because you need one replacement if Group 1 matches and another one if it does not. — Wiktor Stribiżew, Jan 16 '20 at 19:26
I have reused the regex patterns in Postgresql. How can I get the same result without lambda? This is related to the first comment. Thanks a lot. — John Barton, Jan 16 '20 at 19:46
@JuanPerez You can't do that in PostgreSQL, those functions do not support callback as replacement argument. — Wiktor Stribiżew, Jan 16 '20 at 21:26

Regex in Python: Separate words from numbers JUST when not in list

3 Answers3

Linked