I am working on a document classification project. I am using tf-idf and centroid algorithms. But I need a dictionary, for using that algorithms. I have tried information gain for maikng a dictionary but I think it's not satisfied enough. Have you any suggestion for a feature selection algorithm better then information gain?
Asked
Active
Viewed 898 times
2 Answers
2
In my experience, there isn't such thing as best feature selection method. Algorithms which work well for one data set may perform very poorly for others, so it is mostly an experimental question. Try a few and see which works for your problem setting. George Forman has published several articles on the subject, it is worth reading them when you have time.

mbatchkarov
- 15,487
- 9
- 60
- 79
1
It's also worth pointing out that in many cases, feature selection isn't necessary. Just use all the words, with a classifier that's robust to large feature spaces (linear SVM/L1 regularized logistic regression for example). It's one fewer problems to solve, and it's a baseline you'd need to explicitly justify not using.

Ben Allison
- 7,244
- 1
- 15
- 24
-
I don't agree with that assessment. Dimensionality reduction is a technique to develop a more generalised model in machine learning regardless of the robustness of the classifier. Also, it reduces the computational costs of running the model etc. If that is not the case then perhaps I agree with you comment. – OAK Dec 21 '15 at 13:33