How to normalize text with regex?

Question

How to normilize text with regex with some if statements?

If we have string like this One T933 two, three35.4. four 9,3 8.5 five M2x13 M4.3x2.1

And I want to normilize like this one t 933 two three 35.4 four 9,3 8.5 five m2x13 m4.3x2.1

Remove all dots and commas.
Split number and string if not starts with letter 'M' T933 --> T 933
All lowercase
Do not split if there is dot or comma between numbers 35.4 --> 35.4 or 9,3 --> 9.3 if there is comma between, then replace to dot

What I am able to do is this

def process(str, **kwargs):
    str = str.replace(',', '.')
    str = re.split(r'(-?\d*\.?\d+)', str)
    str = ' '.join(str)
    str.lower()
    return str

but there is no if condition when numbers starts with letter 'M' and their also is splitted. And in some reason after string process i get some unnecessary spaces.

Is there some ideas how to do that with regex? Or with help methods like replace, lower, join and so on?

Does `re.sub(r'[.,](?!(?<=\d.)\d)', '', re.sub(r'(?<=[^\W\d_])(?<![MmXx])(?=\d)', ' ', text)).lower()` work as expected? — Wiktor Stribiżew, Jul 26 '22 at 13:10
looks good, but if there is something like this `aa88aa` what I need is `aa 88 aa`, but there this command just split first numbers and letters. What i need to modify to split all nubers and letters in one substring(word) `aa99bb88cc77` --> `aa 99 bb 88 cc 77` — Dmiich, Jul 26 '22 at 13:17
So, `re.sub(r'[.,](?!(?<=\d.)\d)', '', re.sub(r'(?<=[^\W\d_])(?<![MmXx])(?=\d)|(?<=\d)(?=[^\W\d_])', ' ', text)).lower()`? — Wiktor Stribiżew, Jul 26 '22 at 13:23

score 2 · Accepted Answer · answered Jul 26 '22 at 13:29

I can suggest a solution like

re.sub(r'[.,](?!(?<=\d.)\d)', '', re.sub(r'(?<=[^\W\d_])(?<![MmXx])(?=\d)|(?<=\d)(?=[^\W\d_])', ' ', text)).lower()

The outer re.sub is meant to remove dots or commas when not between digits:

[.,] - a comma or dot
(?!(?<=\d.)\d) - a negative lookahead that fails the match if there is a digit immediately to the right, that is immediately preceded with a digit + any one char

The inner re.sub replaces with a space the following pattern:

(?<=[^\W\d_])(?<![MmXx])(?=\d) - a location between a letter ([^\W\d_] matches any letter) and a digit (see (?=\d)), where the letter is not M or X (case insensitive, [MmXx] can be written as (?i:[mx]))
| - or
(?<=\d)(?=[^\W\d_]) - a location between a digit and a letter.

See the Python demo:

import re
text = 'One T933 two, three35.4. four 9,3 8.5 five M2x13 M4.3x2.1 aa88aa'
print( re.sub(r'[.,](?!(?<=\d.)\d)', '', re.sub(r'(?<=[^\W\d_])(?<![MmXx])(?=\d)|(?<=\d)(?=[^\W\d_])', ' ', text)).lower() )

Output:

one t 933 two three 35.4 four 9,3 8.5 five m2 x13 m4.3 x2.1 aa 88 aa

How to normalize text with regex?

1 Answers1