1

How to normilize text with regex with some if statements?

If we have string like this One T933 two, three35.4. four 9,3 8.5 five M2x13 M4.3x2.1

And I want to normilize like this one t 933 two three 35.4 four 9,3 8.5 five m2x13 m4.3x2.1

  1. Remove all dots and commas.
  2. Split number and string if not starts with letter 'M' T933 --> T 933
  3. All lowercase
  4. Do not split if there is dot or comma between numbers 35.4 --> 35.4 or 9,3 --> 9.3 if there is comma between, then replace to dot

What I am able to do is this

def process(str, **kwargs):
    str = str.replace(',', '.')
    str = re.split(r'(-?\d*\.?\d+)', str)
    str = ' '.join(str)
    str.lower()
    return str

but there is no if condition when numbers starts with letter 'M' and their also is splitted. And in some reason after string process i get some unnecessary spaces.

Is there some ideas how to do that with regex? Or with help methods like replace, lower, join and so on?

Dmiich
  • 325
  • 2
  • 16
  • 1
    Does `re.sub(r'[.,](?!(?<=\d.)\d)', '', re.sub(r'(?<=[^\W\d_])(?<![MmXx])(?=\d)', ' ', text)).lower()` work as expected? – Wiktor Stribiżew Jul 26 '22 at 13:10
  • looks good, but if there is something like this `aa88aa` what I need is `aa 88 aa`, but there this command just split first numbers and letters. What i need to modify to split all nubers and letters in one substring(word) `aa99bb88cc77` --> `aa 99 bb 88 cc 77` – Dmiich Jul 26 '22 at 13:17
  • So, `re.sub(r'[.,](?!(?<=\d.)\d)', '', re.sub(r'(?<=[^\W\d_])(?<![MmXx])(?=\d)|(?<=\d)(?=[^\W\d_])', ' ', text)).lower()`? – Wiktor Stribiżew Jul 26 '22 at 13:23

1 Answers1

2

I can suggest a solution like

re.sub(r'[.,](?!(?<=\d.)\d)', '', re.sub(r'(?<=[^\W\d_])(?<![MmXx])(?=\d)|(?<=\d)(?=[^\W\d_])', ' ', text)).lower()

The outer re.sub is meant to remove dots or commas when not between digits:

  • [.,] - a comma or dot
  • (?!(?<=\d.)\d) - a negative lookahead that fails the match if there is a digit immediately to the right, that is immediately preceded with a digit + any one char

The inner re.sub replaces with a space the following pattern:

  • (?<=[^\W\d_])(?<![MmXx])(?=\d) - a location between a letter ([^\W\d_] matches any letter) and a digit (see (?=\d)), where the letter is not M or X (case insensitive, [MmXx] can be written as (?i:[mx]))
  • | - or
  • (?<=\d)(?=[^\W\d_]) - a location between a digit and a letter.

See the Python demo:

import re
text = 'One T933 two, three35.4. four 9,3 8.5 five M2x13 M4.3x2.1 aa88aa'
print( re.sub(r'[.,](?!(?<=\d.)\d)', '', re.sub(r'(?<=[^\W\d_])(?<![MmXx])(?=\d)|(?<=\d)(?=[^\W\d_])', ' ', text)).lower() )

Output:

one t 933 two three 35.4 four 9,3 8.5 five m2 x13 m4.3 x2.1 aa 88 aa
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563