Given the paragraph from Wikipedia:
An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952. Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law. It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action.
I run NLTK nltk.sent_tokenize
to get the sentences. This returns:
['An ambitious campus expansion plan was proposed by Fr.',
'Vernon F. Gallagher in 1952.',
'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.',
'It was during the tenure of Fr.',
'Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action.'
]
While NTLK could handle F. Henry J. McAnulty as one entity, It failed for Fr. Vernon F. Gallagher, and this broke the sentence into two.
The correct tokenization should be:
[
'An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952.',
'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.',
'It was during the tenure of Fr. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action.'
]
How can I improve the tokenizer performance?