Python - how to separate paragraphs from text?

Question

I need to separate texts into paragraphs and be able to work with each of them. How can I do that? Between every 2 paragraphs can be at least 1 empty line. Like this:

Hello world,
  this is an example.

Let´s program something.


Creating  new  program.

Thanks in advance.

Assuming the text is in a text file. Read the file line wise and whenever you encounter a blank line, you know that whatever was above that line belonged to a paragraph. Extend this similarly for upcoming text. — jar, Nov 10 '18 at 16:10
This is clear for me, but I need a help with syntax, how to write this. — kom20, Nov 10 '18 at 16:14
@kom20 do you know how to open a file and read a line? What difficultly do you have specifically ? — Jon Clements, Nov 10 '18 at 16:17
I know this, but I need to align all paragraphs for set width of characters and for that I need to separate paragraphs from the text and work with each individually. — kom20, Nov 10 '18 at 16:21

score 8 · Answer 1 · answered Nov 10 '18 at 17:00

8

This sould work:

text.split('\n\n')

answered Nov 10 '18 at 17:00

roeen30

759
3
13

Thanks, it seems good. But since the end of the text consists of some empty lines, last items in this list are empty (like this): ["something","",""]. Can this make any problem as soon as I get into work with the particular words in these paragraphs? – kom20 Nov 10 '18 at 18:00
This is for you to say. You can always filter them out with `filter(None, ...)` – roeen30 Nov 10 '18 at 18:26
1

The question specifies "at least one empty line", so this solution is not entirely correct on its own. For instance, "one\n\n\n\n\ntwo\n".split("\n\n") is not super pretty. – traal Nov 16 '20 at 18:41
Nope, it is not sufficient. \n\n\n is a valid separator, so should be \n\t\20\n\t\t\n – ChewbaccaKL Jul 03 '22 at 21:17

bolzano · Answer 2 · 2018-11-10T21:14:26.307

5

Try

result = list(filter(lambda x : x != '', text.split('\n\n')))

edited Nov 10 '18 at 21:14

answered Nov 10 '18 at 21:05

bolzano

816
2
13
30

2

While this might answer the authors question, it lacks some explaining words and/or links to documentation. Raw code snippets are not very helpful without some phrases around them. You may also find [how to write a good answer](https://stackoverflow.com/help/how-to-answer) very helpful. Please edit your answer. – hellow Nov 11 '18 at 07:21
I think it would be more pythonic to write this as a list comprension, instead of using filter + lambda. Also, split("\n\n") does not work cleanly if paragraphs are separated by more than a single empty line. – traal Nov 16 '20 at 18:44

traal · Answer 3 · 2020-11-16T19:19:40.163

Not an entirely trivial problem, and the standard library doesn't seem to have any ready solutions.

Paragraphs in your example are split by at least two newlines, which unfortunately makes text.split("\n\n") invalid. I think that instead, splitting by regular expressions is a workable strategy:

import fileinput
import re

NEWLINES_RE = re.compile(r"\n{2,}")  # two or more "\n" characters

def split_paragraphs(input_text=""):
    no_newlines = input_text.strip("\n")  # remove leading and trailing "\n"
    split_text = NEWLINES_RE.split(no_newlines)  # regex splitting

    paragraphs = [p + "\n" for p in split_text if p.strip()]
    # p + "\n" ensures that all lines in the paragraph end with a newline
    # p.strip() == True if paragraph has other characters than whitespace

    return paragraphs

# sample code, to split all script input files into paragraphs
text = "".join(fileinput.input())
for paragraph in split_paragraphs(text):
    print(f"<<{paragraph}>>\n")

Edited to add:

It is probably cleaner to use a state machine approach. Here's a fairly simple example using a generator function, which has the added benefit of streaming through the input one line at a time, and not storing complete copies of the input in memory:

import fileinput

def split_paragraph2(input_lines):
    paragraph = []  # store current paragraph as a list
    for line in input_lines:
        if line.strip():  # True if line is non-empty (apart from whitespace)
            paragraph.append(line)
        elif paragraph:  # If we see an empty line, return paragraph (if any)
            yield "".join(paragraph)
            paragraph = []
    if paragraph:  # After end of input, return final paragraph (if any)
        yield "".join(paragraph)

# sample code, to split all script input files into paragraphs
for paragraph in split_paragraph2(fileinput.input()):
    print(f"<<{paragraph}>>\n")

Prayson W. Daniel · Answer 4 · 2020-11-16T19:26:49.717

1

I usually split then filter out the '' and strip. ;)

a =\
'''
Hello world,
  this is an example.

Let´s program something.


Creating  new  program.


'''

data = [content.strip() for content in a.splitlines() if content]

print(data)

edited Nov 16 '20 at 19:26

answered Nov 10 '18 at 21:25

Prayson W. Daniel

14,191
4
51
57

1

This does not actually split into paragraphs. It splits into non-empty lines. The first two lines in this example should be in the same paragraph! – traal Nov 16 '20 at 18:34
I did not know that is a requirement;) – Prayson W. Daniel Nov 16 '20 at 19:27

score 0 · Answer 5 · edited Apr 30 '20 at 07:47

0

this is worked for me:

text = "".join(text.splitlines())
text.split('something that is almost always used to separate sentences (i.e. a period, question mark, etc.)')

edited Apr 30 '20 at 07:47

Jaimil Patel

1,301
6
13

answered Apr 30 '20 at 04:58

letme sleepplz

1
1

score -1 · Answer 6 · edited May 02 '21 at 05:21

-1

Easier. I had the same problem.

Just replace the double \n\n entry by a term that you seldom see in the text (here ¾):

a ='''
Hello world,
  this is an example.

Let´s program something.


Creating  new  program.'''
a = a.replace("\n\n" , "¾")

splitted_text = a.split('¾')

print(splitted_text)

edited May 02 '21 at 05:21

AlphaModder

3,266
2
28
44

answered May 02 '21 at 04:48

Bolieu_85

1

Python - how to separate paragraphs from text?

6 Answers6

Linked

Related