Segmenting Long text sequence into paragraphs using Python

Question

I'm trying to separate a long text sequence into possible no. of paragraphs. I found this SO question and thought of using 'nltk.tokenize.texttiling'. But I'm getting the following error after trying implementing the code in a notebook as given below.

from nltk.tokenize.texttiling import TextTilingTokenizer
import nltk
nltk.download('stopwords')
tt = TextTilingTokenizer(demo_mode=False)
s, ss, d, b = tt.tokenize("Tokenize a document into topical sections using the TextTiling algorithm. This algorithm detects subtopic shifts based on the analysis of lexical co-occurrence patterns.The process starts by tokenizing the text into pseudosentences of a fixed size w. Then, depending on the method used, similarity scores are assigned at sentence gaps. The algorithm proceeds by detecting the peak differences between these scores and marking them as boundaries. Will not accept claims of zero casualties Mihin Lanka good plan but no funds Defence policy mixed with foreign policy was hoping to work with UNP not signing MCC a mistake. Former Parliamentarian, former Chief Executive Officer of somewhere and a contestant at the August 5 Parliamentary election from the National Democratic Front, someones name, has been mired in controversy. He is being investigated for alleged money laundering committed when he was part of the some presidents Government. Someones name, who is today fighting against the some president's administration, spoke to the Newspaper company online on some of the allegations against him.Well, that depends on the perspective that you look at it. If you take Wikipedia, it is an interactive database. A lot of people can go and write anything they want. I have seen what you are referring to and neither have I gone to correct it because everyone has the right to their own view. I think controversy can be defined in many ways. And your interpretation of controversy may differ from mine. I think what happened is, when you look at the past, some of the work that I have done and some of the involvement in terms of governance, that part of governance that I was involved in, and perhaps the effectiveness and perhaps the success I would have had in those spheres obviously made people jealous. And in politics the game is all about who gets ahead of the other. Once again its perception. If I ask you to tell me one thing I have done using thuggery. I maybe a little arrogant. But that’s my personal nature. I am a little hot-headed. But I have never done any harm to anyone. Absolutely not. When you look at the history, up to 2015, it was alright. And once we lost power in 2015, I was denied my nomination to contest the General Election. I was then incarcerated for seven months and I found out that it was basically a plot from within. Certain members of the family, very close to the President, didn’t want me back. They didn’t want to give me nomination for reasons which are obvious to them and not to me. Also later on, I found that certain actions taken in terms of keeping me imprisoned, certain meddling that they did with certain aspects of the judiciary was with the involvement of the former President as well as a Minister who was then a very powerful figure. No, I must say I don’t think former President person name or former Prime Minister person name had any hand in the matter. Of course, as soon as we lost power (in 2015), everybody was remanded. I was remanded then for seven months for the purported misuse of a vehicle. Seven years have gone and no charge sheet as yet. Publicly I can’t say this because I will be sued and you will be sued, but there was a certain intervention that was done to keep me inside for a longer period of time. The purpose of why that was done was to deny my nomination, and it happened. ")

Error:

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/nltk/tokenize/texttiling.py in _create_token_table(self, token_sequences, par_breaks)
    236             try:
--> 237                 current_par_break = next(pb_iter) #skip break at 0
    238             except StopIteration:

StopIteration: 

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
2 frames
/usr/local/lib/python3.6/dist-packages/nltk/tokenize/texttiling.py in _create_token_table(self, token_sequences, par_breaks)
    238             except StopIteration:
    239                 raise ValueError(
--> 240                     "No paragraph breaks were found(text too short perhaps?)"
    241                     )
    242         for ts in token_sequences:

ValueError: No paragraph breaks were found(text too short perhaps?)

A workaround for this or a suggestion of another working library is much appriciated.

The error messages says clearly that your text might be to short. Is this the case? Can you provide an example of your actual text or an example which has nearly the same size? — cronoik, Jul 29 '20 at 13:36
@cronoik, I've added a longer sequence, but it gives me this error. But I've edited the question with the text that I've been passing. — Dilrukshi Perera, Jul 29 '20 at 14:50
The implementation requires at least two paragraphs (i.e. your text need to contain `\n\n` --> [code](https://www.nltk.org/_modules/nltk/tokenize/texttiling.html)). — cronoik, Jul 29 '20 at 19:48
@cronoik, thank you for pointing that out. So what I wanted was to get this long text sequence to be divided up to paragraphs. Perhaps I may have to use another method, and not 'tokenise()'. — Dilrukshi Perera, Jul 29 '20 at 23:32
Yes, texttilling is not the right algorithmn to do a paragraph segmentation. Have a look at the answers [here](https://stackoverflow.com/questions/3237624/how-to-use-nlp-to-separate-a-unstructured-text-content-into-distinct-paragraphs). — cronoik, Jul 30 '20 at 15:21
@cronoik Why isn't texttilling the right algorithm to do a paragraph segmentation? http://people.ischool.berkeley.edu/~hearst/research/tiling.html "TextTiling is a technique for automatically subdividing texts into multi-paragraph units that represent passages, or subtopics." — Franck Dernoncourt, Aug 27 '20 at 20:21
@FranckDernoncourt Sorry that was maybe a bit unclear. When I wrote texttilling above, I was refering to the `TextTilingTokenizer` and not to the term texttilling (which is a broad term and not a specific algorithm. — cronoik, Aug 28 '20 at 15:13

Segmenting Long text sequence into paragraphs using Python

0 Answers0