-4

I'm trying to obtain multiple tuples from the following 'text' using python findall()

text = '[szur formatter] line 1<?xml version="1.0"?><star>[szur parser] line 2<?xml version="1.0"?><Planet>'

I want to get the following matching patterns from 'text'

    Match 1
    [szur formatter] line 1 
    <?xml version="1.0"?><star>

    Match 2
    [szur parser] line 2
    <?xml version="1.0"?><Planet> 

I'm trying to do this with findall using this regex

re.findall(r'\[(szur.*?[^<])(<.*>+)', text)

this yields

[('szur formatter] line 1', '<?xml version="1.0"?><star>[szur parser] line 2<?xml version="1.0"?><Planet>')]

How to get the expected results. My regex doesn't yield the second tuple. How do I need to amend my regex to obtain this? Any pointers will be appreciated.

Pavel
  • 7,436
  • 2
  • 29
  • 42
vibz
  • 157
  • 1
  • 12

2 Answers2

0

Here's a regexp that makes some assumptions:

>>> re.findall(r"(\[szur.*?[^\]]\] line \d*)([^\[]*)", text)
[('[szur formatter] line 1', '<?xml version="1.0"?><star>'), 
 ('[szur parser] line 2',    '<?xml version="1.0"?><Planet>')]

But seriously, man, if you find yourself parsing a mix of XML and non-XML with regexp, ask yourself: "how did I get here?"

Pavel
  • 7,436
  • 2
  • 29
  • 42
0

I wonder if this is a good idea (using regular expressions, that is) but here you go:

\[szur[^][]*\].*?<\w+>

Use the DOTALL modifier and see a demo on regex101.com.


In Python:
import re

string = """[szur formatter] line 1<?xml version="1.0"?><star>[szur parser] line 2<?xml version="1.0"?><Planet>"""

rx = re.compile(r'(\[szur[^][]*\].*?<\w+>)')

matches = rx.findall(string)
# matches = rx.findall(string, re.DOTALL)
print(matches)
Jan
  • 42,290
  • 8
  • 54
  • 79