1

Having this multiline variable:

raw = '''
CONTENT = ALL
TABLES = TEST.RAW_1
        , TEST.RAW_2
        , TEST.RAW_3
        , TEST.RAW_4
PARALLEL = 4
'''

The structure is always TAG = CONTENT, both strings are NOT fixed and CONTENT could contain new lines.

I need a regex to get:

[('CONTENT', 'ALL'), ('TABLES', 'TEST.RAW_1\n        , TEST.RAW_2\n        , TEST.RAW_3\n        , TEST.RAW_4\n'), ('PARALLEL', '4')]

Tried multiple combinations but I'm not able to stop the regex engine at the right point for TABLES tag as its content is a multiline string delimited by the next tag.

Some attempts from the interpreter:

>>> re.findall(r'(\w+?)\s=\s(.+?)', raw, re.DOTALL)
[('CONTENT', 'A'), ('TABLES', 'T'), ('PARALLEL', '4')]


>>> re.findall(r'^(\w+)\s=\s(.+)?', raw, re.M)
[('CONTENT', 'ALL'), ('TABLES', 'TEST.RAW_1'), ('PARALLEL', '4')]


>>> re.findall(r'(\w+)\s=\s(.+)?', raw, re.DOTALL)
[('CONTENT', 'ALL\nTABLES = TEST.RAW_1\n        , TEST.RAW_2\n        , TEST.RAW_3\n        , TEST.RAW_4\nPARALLEL = 4\n')]

Thanks!

Juan Diego Godoy Robles
  • 14,447
  • 2
  • 38
  • 52

1 Answers1

1

You can use a positive lookahead to make sure you lazily match the value correctly:

(\w+)\s=\s(.+?)(?=$|\n[A-Z])
                ^^^^^^^^^^^^

To be used with a DOTALL modifier so that a . could match a newline symbol. The (?=$|\n[A-Z]) lookahead will require .+? to match up to the end of string, or up to the newline followed with an uppercase letter.

See the regex demo.

And alternative, faster regex (as it is an unrolled version of the expression above) - but DOTALL modifier should NOT be used with it:

(\w+)\s*=\s*(.*(?:\n(?![A-Z]).*)*)

See another regex demo

Explanation:

  • (\w+) - Group 1 capturing 1+ word chars
  • \s*=\s* - a = symbol wrapped with optional (0+) whitespaces
  • (.*(?:\n(?![A-Z]).*)*) - Group 2 capturing 0+ sequences of:
    • .* - any 0+ characters other than a newline
    • (?:\n(?![A-Z]).*)* - 0+ sequences of:
      • \n(?![A-Z]) - a newline symbol not followed with an uppercase ASCII letter
      • .* - any 0+ characters other than a newline

Python demo:

import re
p = re.compile(r'(\w+)\s=\s(.+?)(?=$|\n[A-Z])', re.DOTALL)
raw = '''
CONTENT = ALL
TABLES = TEST.RAW_1
        , TEST.RAW_2
        , TEST.RAW_3
        , TEST.RAW_4
PARALLEL = 4
'''
print(p.findall(raw))
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563