Best Feature Selection Algorithm For Document Classification

Question

I am working on a document classification project. I am using tf-idf and centroid algorithms. But I need a dictionary, for using that algorithms. I have tried information gain for maikng a dictionary but I think it's not satisfied enough. Have you any suggestion for a feature selection algorithm better then information gain?

score 2 · Answer 1 · answered Jan 03 '13 at 09:53

In my experience, there isn't such thing as best feature selection method. Algorithms which work well for one data set may perform very poorly for others, so it is mostly an experimental question. Try a few and see which works for your problem setting. George Forman has published several articles on the subject, it is worth reading them when you have time.

score 1 · Answer 2 · answered Jan 04 '13 at 16:39

1

It's also worth pointing out that in many cases, feature selection isn't necessary. Just use all the words, with a classifier that's robust to large feature spaces (linear SVM/L1 regularized logistic regression for example). It's one fewer problems to solve, and it's a baseline you'd need to explicitly justify not using.

answered Jan 04 '13 at 16:39

Ben Allison

7,244
1
15
24

I don't agree with that assessment. Dimensionality reduction is a technique to develop a more generalised model in machine learning regardless of the robustness of the classifier. Also, it reduces the computational costs of running the model etc. If that is not the case then perhaps I agree with you comment. – OAK Dec 21 '15 at 13:33

Best Feature Selection Algorithm For Document Classification

2 Answers2