1

There is an example in doc of sklearn SVM-Anova. I want to further doGridSearchCV for hyper-paremeters, i.d., C and gamma for SVM, for every percentile of features used in the example like this:

transform = feature_selection.SelectPercentile(feature_selection.f_classif)
clf = Pipeline([('anova', transform), 
                ('normal',preprocessing.StandardScaler()),
                ('svc', svm.SVC())])
parameters = {
'svc__gamma': (1e-3, 1e-4),
'svc__C': (1, 10, 100, 1000)
}      

percentiles = (1, 3, 6, 10, 15, 20, 30, 40, 60, 80, 100)
for percentile in percentiles:
    clf.set_params(anova__percentile=percentile)
    search = GridSearchCV(clf, parameters,cv=StratifiedKFold(y,7,shuffle=True, random_state=5), scoring='roc_auc', n_jobs=1)
    search.fit(X,y)

It works fine, by doing this I can tune the parameters of Anova and SVM simultaneously and use such pair of parameters to build my final model.

However, I am confused about how it works. Does it firstly split the data and go through the pipeline? If so, how can I determine features chosen by Anova if I want to further gain insight of those selected features?

Say, I get a best CV score using a pair of parameters (percentile for Anova and C/gamma for SVM), how could I find out exactly what features were retained in that settings? Because every setting of parameters were tested under CV, each of which contains folds with different training data and therefore different feature set to be evaluated by Anova.

One way I could come out is to intersect the feature sets retained in each fold for that best performing pair of parameters, but I don't know how to modify the code to do it.

Any suggestion or doubt on the method is appreciated and welcomed.

Francis
  • 6,416
  • 5
  • 24
  • 32

1 Answers1

3

You could get rid over the loop over percentiles and just put the percentiles in the parameter grid. Then you can look at the selected features of search.best_estimator_, that is search.best_estimator_.named_steps['anova'].get_support()

Andreas Mueller
  • 27,470
  • 8
  • 62
  • 74
  • Thanks! But how does it determine the final selected features? The best estimator comes with a best percentile which maximizes the CV score, but that specific percentile chose different features in different fold partition under CV, how could I determine the final feature subset? – Francis Jul 07 '15 at 07:50
  • It was retrained on the whole training set after finding the best parameters. – Andreas Mueller Jul 07 '15 at 13:31
  • so it means it refits the best parameter to the whole dataset `X` in this case, while cross validation was done on each fold partition (partial dataset) for the candidate percentiles? – Francis Jul 07 '15 at 16:10
  • Thanks. But May I ask why determine the final feature subset in that way? Why not employing a voting scheme, for example, accumulating the feature subset across all fold partitions, each of which is selected under the threshold of the best percentile. Then we can rank the features' occurrences in the accumulated feature subset. – Francis Jul 08 '15 at 01:36
  • That would be possible, but is not the standard way of using cross-validation. Instead it would be a bagging approach. – Andreas Mueller Jul 09 '15 at 16:25