How to obtain multiple tuples using pythons findall

Question

I'm trying to obtain multiple tuples from the following 'text' using python findall()

text = '[szur formatter] line 1<?xml version="1.0"?><star>[szur parser] line 2<?xml version="1.0"?><Planet>'

I want to get the following matching patterns from 'text'

    Match 1
    [szur formatter] line 1 
    <?xml version="1.0"?><star>

    Match 2
    [szur parser] line 2
    <?xml version="1.0"?><Planet>

I'm trying to do this with findall using this regex

re.findall(r'\[(szur.*?[^<])(<.*>+)', text)

this yields

[('szur formatter] line 1', '<?xml version="1.0"?><star>[szur parser] line 2<?xml version="1.0"?><Planet>')]

How to get the expected results. My regex doesn't yield the second tuple. How do I need to amend my regex to obtain this? Any pointers will be appreciated.

`(\[szur.*?[^\[<]+)([^\[]+)` ? https://regex101.com/r/EaPBwA/2 — splash58, Dec 26 '17 at 15:55

score 0 · Answer 1 · answered Dec 26 '17 at 16:00

Here's a regexp that makes some assumptions:

>>> re.findall(r"(\[szur.*?[^\]]\] line \d*)([^\[]*)", text)
[('[szur formatter] line 1', '<?xml version="1.0"?><star>'), 
 ('[szur parser] line 2',    '<?xml version="1.0"?><Planet>')]

But seriously, man, if you find yourself parsing a mix of XML and non-XML with regexp, ask yourself: "how did I get here?"

score 0 · Answer 2 · answered Dec 26 '17 at 16:45

I wonder if this is a good idea (using regular expressions, that is) but here you go:

\[szur[^][]*\].*?<\w+>

Use the DOTALL modifier and see a demo on regex101.com.

In Python:

import re

string = """[szur formatter] line 1<?xml version="1.0"?><star>[szur parser] line 2<?xml version="1.0"?><Planet>"""

rx = re.compile(r'(\[szur[^][]*\].*?<\w+>)')

matches = rx.findall(string)
# matches = rx.findall(string, re.DOTALL)
print(matches)

How to obtain multiple tuples using pythons findall

2 Answers2