Why does NLTK mis-tokenize quote at end of sentence?

Question

Given a string:

c = 'A problem. She said: "I don\'t know about it."'

And an attempt to tokenize it:

>>> for sindex,sentence in enumerate(sent_tokenize(c)):
...     print str(sindex)+": "+sentence
...
0: A problem.
1: She said: "I don't know about it.
2: "
>>>

Why does NLTK put the end quote of sentence 2 into its own sentence 3? Is there a way to correct this behavior?

alvas · Accepted Answer · 2013-09-22T16:57:53.490

2

Instead of the default sent_tokenize, what you'll need is the realignment feature that is already pre-coded pre-trained in the punkt sentence tokenizer.

>>> import nltk
>>> st2 = nltk.data.load('tokenizers/punkt/english.pickle')
>>> sent = 'A problem. She said: "I don\'t know about it."'
>>> st2.tokenize(sent, realign_boundaries=True)
['A problem.', 'She said: "I don\'t know about it."']

see 6 Punkt Tokenizer section from http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html

edited Sep 22 '13 at 16:57

answered Sep 22 '13 at 16:52

alvas

115,346
109
446
738

1

you will run into fullstop tokenization problems for examples `mr. john` but there's a solution here: http://stackoverflow.com/questions/14095971/how-to-tweak-the-nltk-sentence-tokenizer – alvas Sep 22 '13 at 17:01

score 1 · Answer 2 · answered Sep 22 '13 at 10:07

The default sentence tokenizer is PunktSentenceTokenizer that detects a new sentence each time it founds a period except, for example, the period belongs to an acronym like U.S.A.

In nltk documentation there are examples of how to train a new sentence splitter with different corpus. You can find it here.

So I guess that your problem can't be solved with the default sentence tokenizer and you have to train a new one and try.

Why does NLTK mis-tokenize quote at end of sentence?

2 Answers2