Looking at the source code for sent_tokenize()
reveals that this method currently uses the pre-trained punkt tokenizer, so it is the equivalent to PunktSentenceTokenizer
. Whether or not you will need to retrain your tokenizer depends on the nature of the text you are working with. If it is nothing too exotic, like newspaper articles, then you will likely find the pre-trained tokenizer to be sufficient. Tokenizing boils down to a categorization task, and thus different tokenizers could be compared by using the typical metrics such as precision, recall, f-score etc. on labelled data.
The punkt tokenizer is based on the work published in the following paper:
http://www.mitpressjournals.org/doi/abs/10.1162/coli.2006.32.4.485#.V2ouLXUrLeQ
It is fundamentally a heuristic based approach geared to disambiguating sentence boundaries from abbreviations - the bane of sentence tokenization. Calling it a heuristic approach is not meant to be disparaging. I have used the built-in sentence tokenizer before and it worked fine for what I was doing, of course, my task did not really depend on accurate sentence tokenizing. Or rather, I was able to throw enough data at it where it did not really matter.
Here is an example of a question on SO where a user found the pre-trained tokenizer lacking, and needed to train a new one:
How to tweak the NLTK sentence tokenizer
The text in question was Moby Dick, and the odd sentence structure was tripping up the tokenizer. Some examples of where you might need to train your own tokenizer are social media (e.g. twitter) or technical literature with lots of strange abbreviations not encountered by the pre-trained tokenizer.