0

I'm trying to parse a string with multiple lines.

Suppose it is:

text = '''
Section1
stuff belonging to section1
stuff belonging to section1
stuff belonging to section1
Section2
stuff belonging to section2
stuff belonging to section2
stuff belonging to section2
'''

I want to use the finditer method of the re module to get a dictionary like:

{'section': 'Section1', 'section_data': 'stuff belonging to section1\nstuff belonging to section1\nstuff belonging to section1\n'}
{'section': 'Section2', 'section_data': 'stuff belonging to section2\nstuff belonging to section2\nstuff belonging to section2\n'}

I tried the following:

import re
re_sections=re.compile(r"(?P<section>Section\d)\s*(?P<section_data>.+)", re.DOTALL)
sections_it = re_sections.finditer(text)

for m in sections_it:
    print m.groupdict() 

But this results in:

{'section': 'Section1', 'section_data': 'stuff belonging to section1\nstuff belonging to    section1\nstuff belonging to section1\nSection2\nstuff belonging to section2\nstuff belonging to section2\nstuff belonging to section2\n'}

So the section_data also matches Section2.

I also tried to tell the second group to match all but the first one. But this leads to no output at all.

re_sections=re.compile(r"(?P<section>Section\d)\s+(?P<section_data>^(?P=section))", re.DOTALL)

I know I could use the following re, but I'm looking for a version, where I do not have to tell what the second group looks like.

re_sections=re.compile(r"(?P<section>Section\d)\s+(?P<section_data>[a-z12\s]+)", re.DOTALL)

Thank you very much!

user2221323
  • 493
  • 6
  • 15

1 Answers1

1

Use a look-ahead to match everything up to the next section header, or the end of the string:

re_sections=re.compile(r"(?P<section>Section\d)\s*(?P<section_data>.+?)(?=(?:Section\d|$))", re.DOTALL)

Note that this needs a non-greedy .+? as well, otherwise it'll still match all the way to the end first.

Demo:

>>> re_sections=re.compile(r"(?P<section>Section\d)\s*(?P<section_data>.+?)(?=(?:Section\d|$))", re.DOTALL)
>>> for m in re_sections.finditer(text): print m.groupdict()
... 
{'section': 'Section1', 'section_data': 'stuff belonging to section1\nstuff belonging to section1\nstuff belonging to section1\n'}
{'section': 'Section2', 'section_data': 'stuff belonging to section2\nstuff belonging to section2\nstuff belonging to section2'}
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Already tried, leads to: {'section': 'Section1', 'section_data': 's'} {'section': 'Section2', 'section_data': 's'} – user2221323 Apr 11 '13 at 15:56
  • @user2221323: Yeah, noticed that too; a look-ahead is needed, updated the answer. – Martijn Pieters Apr 11 '13 at 15:56
  • Great! This is working! Is it possible not to mention Section\d again in the last part of the re (?=(?:Section\d|$)) and to use a reference like (?=(?:(?P=section)|$)). This trial results in the same output like in the Question :/ I looked up the Positive lookahead assertion. As far as I understood, it succeeds if the re matches at the current location and the whole re is tried again at the current location? But I don't understand why the |$ is needed? – user2221323 Apr 11 '13 at 16:25
  • No, you can't reuse the `section` match, because it'll only match again if it has the *same section number*, so the exact same literal text. – Martijn Pieters Apr 11 '13 at 16:28
  • @user2221323: The look-ahead acts as an anchor, text before it matches if the position for the look-ahead matches the `Section\d` part next. The `|$` part is needed to match the *last* entry in your text; either there is a *next* section or we are at the end of the string. – Martijn Pieters Apr 11 '13 at 16:29