1

I want to fetch a very big text file and split it into paragraphs. The text file can have more than one line break to separate paragraphs. I am usign El Quijote. I would like to do it with the nltk library, because probably I will be using it later.

With a little regex, in python, I can do this, and it works. Basically I split text into paragraphs the following way, I wonder if there's a simpler, more eficient way:

cleared_txt = re.sub('\n{2,}', '\n\n', txt)
paragraphs = cleared_txt.split('\n\n')

The following code creates a file with only one line break per paragraph and fixes the original file structure.

Is it posible to do this with nltk?

book_file = os.path.join(os.getcwd(), 'data/el_quijote.txt')
book_fixed = os.path.join(os.getcwd(), 'data/el_quijote_fixed.txt')
with open(book_file, encoding='utf-8') as f:
    txt = f.read()
cleared_txt = re.sub('\n{2,}', '\n\n', txt)
with open(book_fixed, "w", encoding='utf-8') as f2:
    f2.write(cleared_txt)
paragraphs = cleared_txt.split('\n\n')
count_blank = 0
count_text = 0
for p in paragraphs:
    p = p.strip()
    if p == "":
        count_blank += 1
    elif len(p) > 0:
        count_text += 1
print("Total paragraphs: {0},\ntotal blank lines: {1},\ntotal non empty blocks: {2}".format(len(paragraphs), count_blank, count_text))

The code works as expected and displays this output:

Total paragraphs: 5255,
total blank lines: 0,
total non empty blocks: 5255

This is a similar question to these ones, but no one of them has a concrete answer.

How to split Text into paragraphs using NLTK nltk.tokenize.texttiling?

Split Text into paragraphs NLTK - usage of nltk.tokenize.texttiling?

santi
  • 552
  • 8
  • 19
  • 1
    You may use `re.split` directly - `paragraphs = re.split(r'\n{2,}', txt)`. Or `r'(?:\r?\n){2,}'` – Wiktor Stribiżew Feb 27 '19 at 13:53
  • Also, have you seen [this thread](https://stackoverflow.com/q/39971017/3832970)? – Wiktor Stribiżew Feb 27 '19 at 14:00
  • @WiktorStribiżew, I can do that, I did it the other way to write an output file. Also, the method in that thread is giving an asertion error, I am cheking it out, thanks. – santi Feb 27 '19 at 14:50
  • 1
    Well, you may read a file line by line and append a paragraph to a list once you encounter an empty line. It might be quicker than a regex on a big text. – Wiktor Stribiżew Feb 27 '19 at 14:52

1 Answers1

0

I wanted to put this as a comment, but it was too long for the comment.

One way to break the long texts into paragraphs is to split them by lines and then join them in set of paragraphs. The problem with this approach is that we loose the paragraph cohesion, however it should work for some of the cases. Use nlkt tokenizer to split the sentences.

nltk_tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")
    sentences = [
        sentence
        if len(sentence.split(" "))
        < 60  # Model tokens size, update accordingly
        else summarise(summary_tokenizer, summary_model, sentence)
        for sentence in nltk_tokenizer.tokenize(texts)
    ]
Arshad Ansari
  • 324
  • 4
  • 10