5

I was looking at methods to split documents into paragraphs and I came across texttiling as one possible way to do this.

Here is my attempt to use it. However, I don't understand how to work with the output. I'd appreciate your help.

t = unidecode(doclist[0].decode('utf-8','ignore'))

nltk.tokenize.texttiling.TextTilingTokenizer(t)

output:

<nltk.tokenize.texttiling.TextTilingTokenizer at 0x11e9c6350>
Brian Burns
  • 20,575
  • 8
  • 83
  • 77
user3314418
  • 2,903
  • 9
  • 33
  • 55

1 Answers1

4

I'm messing around with this one myself just now for the same reason you are and had the same question you did so don't be too upset if this is wrong. I figured best to pass on what little I know... :)

I'm not sure yet but I found in this bug report an example of using the TextTilingTokenizer:

alice=nltk.corpus.gutenberg.raw('carroll-alice.txt')
ttt = nltk.tokenize.TextTilingTokenizer()
tiles = ttt.tokenize(alice[140309 : ])

It appears that you want to feed your text to the tokenize method on the the TextTilingTokenizer.

unclejamil
  • 445
  • 3
  • 10
  • I am having trouble getting this to return text tokenized by paragraph / topic change. I have `email = """ hi, **body of text** regards, X _disclaimer_""" tt = TextTilingTokenizer(demo_mode=False) tiles = tt.tokenize(email)` The problem is this code returns a list of length 1 i.e. the same string email but in a list. **What parameters / arguments do I have to change in `tt` to ensure this works?** I basically want to get rid of everything but the body of the email, but I can't use regular expressions because the structure and wording of every email changes. Thanks – PyRsquared Apr 05 '17 at 07:58