6

How would one write a regular expression to use in Python to split paragraphs?

A paragraph is defined by two line breaks (\n). But one can have any amount of spaces/tabs together with the line breaks, and it still should be considered as a paragraph.

I am using Python, so the solution can use Python's regular expression syntax which is extended. (can make use of (?P...) stuff)

Examples:

the_str = 'paragraph1\n\nparagraph2'
# Splitting should yield ['paragraph1', 'paragraph2']

the_str = 'p1\n\t\np2\t\n\tstill p2\t   \n     \n\tp3'
# Should yield ['p1', 'p2\t\n\tstill p2', 'p3']

the_str = 'p1\n\n\n\tp2'
# Should yield ['p1', '\n\tp2']

The best I could come with is: r'[ \t\r\f\v]*\n[ \t\r\f\v]*\n[ \t\r\f\v]*', i.e.

import re
paragraphs = re.split(r'[ \t\r\f\v]*\n[ \t\r\f\v]*\n[ \t\r\f\v]*', the_str)

But that is ugly. Is there anything better?

Suggestions rejected:

r'\s*?\n\s*?\n\s*?' -> That would make example 2 and 3 fail, since \s includes \n, so it would allow paragraph breaks with more than 2 \ns.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
nosklo
  • 217,122
  • 57
  • 293
  • 297

4 Answers4

6

Unfortunately there's no nice way to write "space but not a newline".

I think the best you can do is add some space with the x modifier and try to factor out the ugliness a bit, but that's questionable: (?x) (?: [ \t\r\f\v]*? \n ){2} [ \t\r\f\v]*?

You could also try creating a subrule just for the character class and interpolating it three times.

Eevee
  • 47,412
  • 11
  • 95
  • 127
3

It is not a regexp, but it is really elegant:

from itertools import groupby

def paragraph(lines):
    for group_separator, line_iteration in groupby(lines.splitlines(True), key = str.isspace):
        if not group_separator:
            yield ''.join(line_iteration)

for p in paragraph('p1\n\t\np2\t\n\tstill p2\t   \n     \n\tp'):
    print repr(p)

'p1\n'
'p2\t\n\tstill p2\t   \n'
'\tp3'

It's up to you to strip the output as you need it of course.

It was inspired by the famous "Python Cookbook" ;-)

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Bite code
  • 578,959
  • 113
  • 301
  • 329
  • Neat solution. What's `str_isspace`? – Brian M. Hunt Nov 01 '11 at 18:12
  • A typo :-) You should read str.isspace which a the method isspace() from the object string. It will be called to determine if something is a space, and will group object according to that. I fixed it. – Bite code Nov 01 '11 at 20:19
2

You may be trying to deduce the structure of a document in plain test and doing what docutils does.

You might be able to simply use the Docutils parser rather than roll your own.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
S.Lott
  • 384,516
  • 81
  • 508
  • 779
1

Almost the same, but using non-greedy quantifiers and taking advantage of the whitespace sequence.

\s*?\n\s*?\n\s*?
Joseph Bui
  • 1,701
  • 15
  • 22