2

I would like to build a language model for a text corpus. Are there good out-of-the-box toolkits which will alleviate my task? The only toolkit I know off is the Statistical Language Modelling(SLM) Toolkit by CMU.

Regards,

Stefanus
  • 1,619
  • 3
  • 12
  • 23
Dexter
  • 11,311
  • 11
  • 45
  • 61

3 Answers3

2

NLTK is very powerful, though I've never used it.

Ned Batchelder
  • 364,293
  • 75
  • 561
  • 662
  • +1 Natural Language Processing Toolkit is your best choice. Download from nltk.org or buy the book from the Oreilly site. It is close to a must-have. IMO. – jim mcnamara Jul 21 '10 at 14:06
  • I have used NLTK in the past but language models using NLTK is something which I never knew about. – Dexter Jul 22 '10 at 12:10
  • http://nltk.googlecode.com/svn/trunk/doc/api/nltk.model.ngram.NgramModel-class.html I finally got hold of the class but there seems to be no documentation for the same ! – Dexter Jul 22 '10 at 19:12
  • 1
    I can safely say NLTK is not really powerful after all. Reason: http://code.google.com/p/nltk/issues/detail?id=232 To be honest, it is absolutely disappointing to try doing something which is a "basic" model in machine learning and not just NOT implemented in NLTK but very few toolkits in popular languages like Java/Python around. – Dexter Jul 23 '10 at 18:03
1

The SRILM toolkit is very useful.

http://www.speech.sri.com/projects/srilm/

Aaron
  • 2,354
  • 1
  • 17
  • 25
0

KenLM is also worth trying. It's fast and uses good default settings. In contrast to SRILM, it offers less options for configuration.

Stefanus
  • 1,619
  • 3
  • 12
  • 23