How to improve NLTK sentence segmentation?

Question

Given the paragraph from Wikipedia:

An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952. Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law. It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action.

I run NLTK nltk.sent_tokenize to get the sentences. This returns:

['An ambitious campus expansion plan was proposed by Fr.', 
'Vernon F. Gallagher in 1952.', 
'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.', 
'It was during the tenure of Fr.', 
'Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action.'
 ]

While NTLK could handle F. Henry J. McAnulty as one entity, It failed for Fr. Vernon F. Gallagher, and this broke the sentence into two.

The correct tokenization should be:

[
'An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952.', 
'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.', 
'It was during the tenure of Fr. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action.'
 ]

How can I improve the tokenizer performance?

This is probably something you should ask the devs directly... — Mad Physicist, Nov 13 '17 at 22:23
Could you share the full text that you want to sentence tokenize? — alvas, Nov 14 '17 at 04:36
@alvas the input text is the plain text from Wikipedia article https://en.wikipedia.org/wiki/Duquesne_University — Abdulrahman Bres, Nov 14 '17 at 05:23
@alvas the output for the small paragraph I posted was ['An ambitious campus expansion plan was proposed by Fr.', 'Vernon F. Gallagher in 1952.', 'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.', 'It was during the tenure of Fr.', 'Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action.'] — Abdulrahman Bres, Nov 14 '17 at 05:25

alvas · Accepted Answer · 2017-11-14T05:11:07.700

The awesome-ness of Kiss and Strunk (2006) Punkt algorithm is that it's unsupervised. So given a new text, you should retrain the model and apply the model to your text, e.g.

>>> from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
>>> text = "An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952. Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law. It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action."

# Training a new model with the text.
>>> tokenizer = PunktSentenceTokenizer()
>>> tokenizer.train(text)
<nltk.tokenize.punkt.PunktParameters object at 0x106c5d828>

# It automatically learns the abbreviations.
>>> tokenizer._params.abbrev_types
{'f', 'fr', 'j'}

# Use the customized tokenizer.
>>> tokenizer.tokenize(text)
['An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952.', 'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.', "It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action."]

Where there's not enough data to generate good statistics when re-training the model, you can also put in a pre-determined list of abbreviations before training; see How to avoid NLTK's sentence tokenizer spliting on abbreviations?

>>> from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters

>>> punkt_param = PunktParameters()
>>> abbreviation = ['f', 'fr', 'k']
>>> punkt_param.abbrev_types = set(abbreviation)

>>> tokenizer = PunktSentenceTokenizer(punkt_param)
>>> tokenizer.train(text)
<nltk.tokenize.punkt.PunktParameters object at 0x106c5d828>

>>> tokenizer.tokenize(text)
['An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952.', 'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.', "It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action."]

Can I pass a pre-determined list of abbreviations to the default Punkt model? (without re-training) — Abdulrahman Bres, Nov 14 '17 at 05:32
Always retrain, it doesn't hurt and it might be a better option than using a pre-trained model. — alvas, Nov 14 '17 at 05:33
Try to get away from the supervised learning mentality of train -> infer. And imagine "customized everything" in unsupervised-land =) — alvas, Nov 14 '17 at 05:34
Thank you :) while I am working on the entire Wikipedia, I am curious to know if it is sufficient to re-train per article, or to train a model beforehand on all articles? — Abdulrahman Bres, Nov 14 '17 at 05:36
Try both, it's not that expensive =) Please do share which get better results after re-training! — alvas, Nov 14 '17 at 05:37

How to improve NLTK sentence segmentation?

1 Answers1