14

i have a training set and i want to use a classification method for classifying other documents according to my training set.my document types are news and categories are sports,politics,economic and so on.

i understand naive bayes and KNN completely but SVM and decision tree are vague and i dont know if i can implement this method by myself?or there is applications for using this methods?

what is the best method i can use for classifying docs in this way?

thanks!

mshzmkot
  • 351
  • 2
  • 7
  • 22

3 Answers3

13
  • Naive Bayes

Though this is the simplest algorithm and everything is deemed independent, in real text classification case, this method work great. And I would try this algorithm first for sure.

  • KNN

KNN is for clustering rather than classification. I think you misunderstand the conception of clustering and classification.

  • SVM

SVM has SVC(classification) and SVR(Regression) algorithms to do class classification and prediction. It sometime works good, but from my experiences, it has bad performance in text classification, as it has high demands for good tokenizers (filters). But the dictionary of the dataset always has dirty tokens. The accuracy is really bad.

  • Random Forest (decision tree)

I've never try this method for text classification. Because I think decision tree need several key nodes, while it's hard to find "several key tokens" for text classification, and random forest works bad for high sparse dimensions.

FYI

These are all from my experiences, but for your case, you have no better ways to decide which methods to use but to try every algorithm to fit your model.

Apache's Mahout is a great tool for machine learning algorithms. It integrates three aspects' algorithms: recommendation, clustering, and classification. You could try this library. But you have to learn some basic knowledge about Hadoop.

And for machine learning, weka is a software toolkit for experiences which integrates many algorithms.

Freya Ren
  • 2,086
  • 6
  • 29
  • 39
  • 2
    -1. SVM are one of the top techniques for text classification as is evidenced by a large amount of publications on the topic. You should be using **SVC** for classification, not **SVR**. – Marc Claesen Jul 03 '13 at 11:07
  • 1
    From my experiences using SVM for text classification, the accuracy is always not good. I think this related to what text data you use. Also, thanks for pointing out the mistake. – Freya Ren Jul 05 '13 at 15:43
7

Linear SVMs are one of the top algorithms for text classification problems (along with Logistic Regression). Decision Trees suffer badly in such high dimensional feature spaces.

The Pegasos algorithm is one of the simplest Linear SVM algorithms and is incredibly effective.

EDIT: Multinomial Naive bayes also works well on text data, though not usually as well as Linear SVMs. kNN can work okay, but its an already slow algorithm and doesn't ever top the accuracy charts on text problems.

Raff.Edward
  • 6,404
  • 24
  • 34
  • what about KNN and naive bayes? – mshzmkot Jul 02 '13 at 05:37
  • yes,i want to know which method is best for my problem.i have less than 10 predefined class. – mshzmkot Jul 02 '13 at 05:47
  • Its not mean to be humiliating. Its meant to get the point across. A lot of people abuse /use Stackoverflow as a crutch. You need to take what you have and go beyond that. Use it as a tool to help you learn. – Raff.Edward Jul 02 '13 at 14:36
  • 1
    I want to give a thumbs up to the pegasos algorithm, it is often overlooked but it is really easy to implement and a very decent alternative to Linear SVM. – Pedrom Jul 02 '13 at 15:05
2

If you are familiar with Python, you may consider NLTK and scikit-learn. The former is dedicated to NLP while the latter is a more comprehensive machine learning package (but it has a great inventory of text processing modules). Both are open source and have great community suport on SO.

Moses Xu
  • 2,140
  • 4
  • 24
  • 35