0

For my current project I have to build a topic modeling or classification utility which will process thousands of articles to classify them into various topics (topics may be 40-50 to start off with). For e.g. it'll go over database technologies articles and classify them whether an article is NOSQL article/ Relational DB Article/ Graph Database article (just an example).

I have very basic NLP background and our team mostly has python backend scripting experience. I began looking into various options available to implement it and came across NLTK and Scikit-Learn which are Python based and also Weka and Mallet which are JVM based.
My understanding is that NLTK is more suited to learn and understand various NLP techniques like Topic classification.

Can someone suggest what may be the best open source solution that we can use for our implementation? Please let me know if I missed on any information that will help with the answers.

whosthr
  • 21
  • 3
  • 2
    Do you have an existing training set of articles? If so, how big is it? Also, your example topics are very close together and are therefore much harder for an algorithm to classify correctly than if they were fishing, astronomy and 16th century painters. Those details largely determines what algorithm would be appropriate for your case. – Björn Lindqvist Apr 10 '13 at 08:05
  • Suggestions for "best" toolkits are off-topic. See the [FAQ](http://stackoverflow.com/faq). If you're looking for performance, I would avoid NLTK, which is mostly an educational toolkit, though it can be used for prototyping. – Fred Foo Apr 10 '13 at 14:04
  • Yes, we have training set of articles available for some topics (20-50 articles). We have an option of starting with more varied topics and then move towards more refined topics. – whosthr Apr 11 '13 at 19:03
  • Thanks for confirming my understanding on NLTK. I want to start of with a framework that can help through more varied topics for now. Mallet is looking a probable option - any experience or comments on that? Will appreciate what are other options as well... – whosthr Apr 11 '13 at 19:12

1 Answers1

0

Building a Topic Classification model can be done into two ways. If you have a training set where you have labels against the documents , you can always build a classifier using scikit learn

But if you don't have any training data , you can build something that is called a topic model. It basically gives you topics as group of words.

You can use Gensim package to implement this. Very crisp , fast and easy to implement (Look Here)

Gyan Ranjan
  • 101
  • 2