0

Given a string:

c = 'A problem. She said: "I don\'t know about it."'

And an attempt to tokenize it:

>>> for sindex,sentence in enumerate(sent_tokenize(c)):
...     print str(sindex)+": "+sentence
...
0: A problem.
1: She said: "I don't know about it.
2: "
>>>

Why does NLTK put the end quote of sentence 2 into its own sentence 3? Is there a way to correct this behavior?

mix
  • 6,943
  • 15
  • 61
  • 90

2 Answers2

2

Instead of the default sent_tokenize, what you'll need is the realignment feature that is already pre-coded pre-trained in the punkt sentence tokenizer.

>>> import nltk
>>> st2 = nltk.data.load('tokenizers/punkt/english.pickle')
>>> sent = 'A problem. She said: "I don\'t know about it."'
>>> st2.tokenize(sent, realign_boundaries=True)
['A problem.', 'She said: "I don\'t know about it."']

see 6 Punkt Tokenizer section from http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html

alvas
  • 115,346
  • 109
  • 446
  • 738
  • 1
    you will run into fullstop tokenization problems for examples `mr. john` but there's a solution here: http://stackoverflow.com/questions/14095971/how-to-tweak-the-nltk-sentence-tokenizer – alvas Sep 22 '13 at 17:01
1

The default sentence tokenizer is PunktSentenceTokenizer that detects a new sentence each time it founds a period except, for example, the period belongs to an acronym like U.S.A.

In nltk documentation there are examples of how to train a new sentence splitter with different corpus. You can find it here.

So I guess that your problem can't be solved with the default sentence tokenizer and you have to train a new one and try.

moliware
  • 10,160
  • 3
  • 37
  • 47