I want to fetch a very big text file and split it into paragraphs. The text file can have more than one line break to separate paragraphs. I am usign El Quijote. I would like to do it with the nltk library, because probably I will be using it later.
With a little regex, in python, I can do this, and it works. Basically I split text into paragraphs the following way, I wonder if there's a simpler, more eficient way:
cleared_txt = re.sub('\n{2,}', '\n\n', txt)
paragraphs = cleared_txt.split('\n\n')
The following code creates a file with only one line break per paragraph and fixes the original file structure.
Is it posible to do this with nltk?
book_file = os.path.join(os.getcwd(), 'data/el_quijote.txt')
book_fixed = os.path.join(os.getcwd(), 'data/el_quijote_fixed.txt')
with open(book_file, encoding='utf-8') as f:
txt = f.read()
cleared_txt = re.sub('\n{2,}', '\n\n', txt)
with open(book_fixed, "w", encoding='utf-8') as f2:
f2.write(cleared_txt)
paragraphs = cleared_txt.split('\n\n')
count_blank = 0
count_text = 0
for p in paragraphs:
p = p.strip()
if p == "":
count_blank += 1
elif len(p) > 0:
count_text += 1
print("Total paragraphs: {0},\ntotal blank lines: {1},\ntotal non empty blocks: {2}".format(len(paragraphs), count_blank, count_text))
The code works as expected and displays this output:
Total paragraphs: 5255,
total blank lines: 0,
total non empty blocks: 5255
This is a similar question to these ones, but no one of them has a concrete answer.
How to split Text into paragraphs using NLTK nltk.tokenize.texttiling?
Split Text into paragraphs NLTK - usage of nltk.tokenize.texttiling?