Feature selection using correlation

Question

I'm doing feature selection to train my Machine Learning (ML) models using correlation. I trained the each model(SVM, NN,RF) with all features and did a 10-fold cross validation to obtain mean accuracy score value. Then I removed features which has a zero correlation coefficient (which implies there is no relationship between feature and class) and trained the each model(SVM, NN,RF) with all features and did a 10-fold cross validation to obtain mean accuracy score value.

Basically my objective is to do feature selection based on accuracy scores I get in above two scenarios. But I'm not sure whether this is a good approach for feature selection.

Also I want to do a grid search to identify best model parameters. but I'm getting confused with GridSearchCV in Scikit learn API. Since it also do a cross validation (default 3 folds) can I use best_score_ value obtained doing a grid search in above two scenarios to determine what are the good features for model training?

Please advice me on this confusion, or please suggest me with a good reference to read.

Thanks in advance

score 0 · Accepted Answer · answered Oct 31 '17 at 22:38

As a page 51 of this thesis says,

In other words, a feature is useful if it is correlated with or predictive of the class; otherwise it is irrelevant.

The report goes on to say that not only should you remove the features that are not correlated with the targets, you should also watch out for features that correlate heavily with each other. Also see this.

In other words, it seems to be a good thing to look at correlation of features with the classes (targets) and remove the features that have little to no correlation.

Basically my objective is to do feature selection based on accuracy scores I get in above two scenarios. But I'm not sure whether this is a good approach for feature selection.

Yes, you can totally run experiments with different feature sets and look at the test accuracy to select the features that perform the best. It's really important that you only look at the test accuracy i.e. performance of the model on unseen data.

Also I want to do a grid search to identify best model parameters.

Grid search is performed for finding the best hyper-parameters. Model parameters are learned during training.

Since it also do a cross validation (default 3 folds) can I use best_score_ value obtained doing a grid search in above two scenarios to determine what are the good features for model training?

If the set of hyper-parameters is fixed, the best score value will be affected only by the feature set, and thus can be used to compare effectiveness of the features.

Thanks a lot for your detailed clarification – Ann Nov 01 '17 at 15:35 — Ann, Nov 01 '17 at 15:35

Feature selection using correlation

1 Answers1