8

I need to split text into sentences. I'm currently playing around with OpenNLP's sentence detector tool. I've also heard of NLTK and Stanford CoreNLP tools. What is the most accurate English sentence detection tools out there? I don't need too many NLP features--only a good tool for sentence splitting/detection.

I've also heard about Lucene...but that may be too much. But if it has a kick-ass sentence detection module, then I'll use it.

samxli
  • 1,536
  • 5
  • 17
  • 28
  • 1
    For Perl, [Lingua::EN::Sentence](http://search.cpan.org/~shlomoy/Lingua-EN-Sentence-0.25/lib/Lingua/EN/Sentence.pm)? – Konerak Mar 14 '11 at 16:50

3 Answers3

2

NLTK includes an implementation of the Punkt tokenizer described in this paper. I don't know if it's the absolute best around but it's very very good, it's lightweight and easy to use, and it's free.

rmalouf
  • 3,353
  • 1
  • 15
  • 10
1

check lingpipe implementation http://alias-i.com/lingpipe/docs/api/com/aliasi/sentences/IndoEuropeanSentenceModel.html

Their model quite powerful, and easy to implement - check few pre/post rules(aka regexps) at any possible sentence split and thats all. I found it working better then one in GATE and OpenNLP.

There are another open source project which support this heuristic model as example, http://code.google.com/p/graph-expression/wiki/SentenceSplitting

yura
  • 14,489
  • 21
  • 77
  • 126
  • Their licensing fee is quite hefty, and if I use the royalty-free license they require: "Data processed must be freely available". – samxli Mar 15 '11 at 01:22
  • Then you can check my project graph-expression which is currently GPL, but I thinking about changing to LGPL in case I found other commiters. – yura Mar 15 '11 at 13:33
  • I just checked out your project. Will be testing it tomorrow :). I took a look at NLTK today and Lingua::EN::Sentence on CPAN. NLTK was okay, it had some inaccuracies. Lingua::EN had a hard time recognizing ordered lists as a chunk. It allows for additional abbreviation definitions but couldn't recognize "1.", "2.", etc. – samxli Mar 15 '11 at 14:43
-4

Perl is a text processing language that is an excellent and simple resource for text mining. It has absolutely no problem doing sentence splitting.

www.perl.org

Ralph Winters
  • 297
  • 1
  • 5
  • 1
    Are there certain sentence splitting models available for perl? For different domains, sentences may be defined differently. Also, it needs to be able to handle abbreviations and double spacing after periods, etc. – samxli Mar 15 '11 at 03:46
  • Perl is a text processing, pattern matching language. Abbreviations and spacing issues can be handled. – Ralph Winters Mar 15 '11 at 19:36
  • This answer is not of the quality of the others that mention NLTK, LingPipe, or other specific NLP tools. Sentence splitting is harder than just regex matching -- I don't recommend reinventing the wheel. – David J. Nov 13 '12 at 17:14
  • @DavidJames - David, Perl has been around since 1987 and has a WEALTH of source material for performing simple as well as complex sentence splitting. I factor that as a characteristic into what I would define as quality. – Ralph Winters Nov 14 '12 at 15:39
  • @RalphWinters I'm not saying Perl is low quality. Your answer doesn't go into any detail about what modules to use. – David J. Nov 15 '12 at 04:23