4

I found this Split Text into paragraphs NLTK - usage of nltk.tokenize.texttiling? explaining how to feed a text into texttiling, however I am unable to actually return a text tokenized by paragraph / topic change as shown here under texttiling http://www.nltk.org/api/nltk.tokenize.html.

When I feed my text into texttiling, I get the same untokenized text back, but as a list, which is of no use to me.

    tt = nltk.tokenize.texttiling.TextTilingTokenizer(w=20, k=10,similarity_method=0, stopwords=None, smoothing_method=[0], smoothing_width=2, smoothing_rounds=1, cutoff_policy=1, demo_mode=False)

    tiles = tt.tokenize(text) # same text returned

What I have are emails that follow this basic structure

    From: X
    To: Y                             (LOGISTICS)
    Date: 10/03/2017

    Hello team,                       (INTRO)

    Some text here representing
    the body                          (BODY)
    of the text.

    Regards,                          (OUTRO)
    X

    *****DISCLAIMER*****              (POST EMAIL DISCLAIMER)
    THIS EMAIL IS CONFIDENTIAL
    IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL

If we call this email string s, it would look like

    s = "From: X\nTo: Y\nDate: 10/03/2017 Hello team,\nSome text here representing the body of the text. Regards,\nX\n\n*****DISCLAIMER*****\nTHIS EMAIL IS CONFIDENTIAL\nIF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"

What I want to do is return these 5 sections / paragraphs of string s - LOGISTICS, INTRO, BODY, OUTRO, POST EMAIL DISCLAIMER - separately so I can remove everything but the BODY of the text. How can I return these 5 sections separately using nltk texttiling?

*** Not all emails follow this same structure or have the same wording, so I can't use regular expressions.

Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501
PyRsquared
  • 6,970
  • 11
  • 50
  • 86
  • I am also trying to get the relevant body paragraph part from the mail. Were you able to do the same from textTiling?? –  Oct 09 '18 at 18:57
  • consider, alternatively, using the `nltk.tokenize::BlanklineTokenizer` – axolotl Jan 07 '19 at 15:37

2 Answers2

2

What about using splitlines? Or do you have to use the nltk package?

email = """    From: X
    To: Y                             (LOGISTICS)
    Date: 10/03/2017

    Hello team,                       (INTRO)

    Some text here representing
    the body                          (BODY)
    of the text.

    Regards,                          (OUTRO)
    X

    *****DISCLAIMER*****              (POST EMAIL DISCLAIMER)
    THIS EMAIL IS CONFIDENTIAL
    IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"""

y = [s.strip() for s in email.splitlines()]

print(y)
MattR
  • 4,887
  • 9
  • 40
  • 67
  • Thanks @MattR, but with this method I need to manually delete all lines before the _m_'th line and all lines after the _n_'th line where _m_ and _n_ vary for every email. So this doesn't seem feasible. I don't have to use nltk, the main goal is just to delete everything but the BODY section – PyRsquared Apr 04 '17 at 06:55
  • @KillianTattan Not sure how you would programmatically identify the body section.. As you mentioned each email would be different. Some might even have more than one BODY section. My only thought would be to create some statistical model to identify the body sections... but that would take some time. Depends on the severity of the need. – MattR Apr 04 '17 at 12:41
  • I could split by lines and use naive bayes to classify each line and remove it if needed, but that would take too much time. texttiling already does this in a much more sophisticated way from what I can understand. I just need to get the function working! Thanks for your help @MarkR – PyRsquared Apr 05 '17 at 07:11
0

What I want to do is return these 5 sections / paragraphs of string s - LOGISTICS, INTRO, BODY, OUTRO, POST EMAIL DISCLAIMER - separately so I can remove everything but the BODY of the text. How can I return these 5 sections separately using nltk texttiling?

The texttiling algorithm {1,4,5} isn't designed to perform sequential text classification {2,3} (which is the task you described). Instead, from http://people.ischool.berkeley.edu/~hearst/research/tiling.html:

TextTiling is [an unsupervised] technique for automatically subdividing texts into multi-paragraph units that represent passages, or subtopics.


References:

  • {1} Marti A. Hearst, Multi-Paragraph Segmentation of Expository TextProceedings of the 32nd Meeting of the Association for Computational Linguistics, Los Cruces, NM, June, 1994. pdf
  • {2} Lee, J.Y. and Dernoncourt, F., 2016, June. Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 515-520). https://www.aclweb.org/anthology/N16-1062.pdf
  • {3} Dernoncourt, Franck, Ji Young Lee, and Peter Szolovits. "Neural Networks for Joint Sentence Classification in Medical Paper Abstracts." In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 694-700. 2017. https://www.aclweb.org/anthology/E17-2110.pdf
  • {4} Hearst, M. TextTiling: Segmenting Text into Multi-Paragraph Subtopic PassagesComputational Linguistics, 23 (1), pp. 33-64, March 1997. pdf
  • {5} Pevzner, L., and Hearst, M., A Critique and Improvement of an Evaluation Metric for Text SegmentationComputational Linguistics, 28 (1), March 2002, pp. 19-36. pdf
Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501