8

I am trying to do features selection as a part of the a scikit-learn pipeline, on a multi-label scenario. My purpose is to select best K features, for some given k.

It might be simple, but I don't understand how to get the selected features indices in such a scenario.

on a regular scenario I could do something like that:

anova_filter = SelectKBest(f_classif, k=10)

anove_filter.fit_transform(data.X, data.Y)

anova_filter.get_support()

but on a multilabel scenario my labels dimensions are #samples X #unique_labels so fit and fit_transform yield the following exception: ValueError: bad input shape

which makes sense, because it expects labels of dimension [#samples]

on the multilabel scenario, it makes sense to do something like that:

clf = Pipeline([('f_classif', SelectKBest(f_classif, k=10)),('svm', LinearSVC())])

multiclf = OneVsRestClassifier(clf, n_jobs=-1)

multiclf.fit(data.X, data.Y)

but then the object I'm getting is of type sklearn.multiclass.OneVsRestClassifier which doesn't have a get_support function. How do I get the trained SelectKBest model when it's used during a pipeline?

Delli22
  • 305
  • 2
  • 8

1 Answers1

10

The way you set it up, there will be one SelectKBest per class. Is that what you intended? You can get them via

multiclf.estimators_[i].named_steps['f_classif'].get_support()

If you want one feature selection for all the OvR models, you can do

clf = Pipeline([('f_classif', SelectKBest(f_classif, k=10)),
                ('svm', OneVsRestClassifier(LinearSVC()))])

and get the single feature selection with

clf.named_steps['f_classif'].get_support()
Andreas Mueller
  • 27,470
  • 8
  • 62
  • 74
  • Thanks Andreas. I actually tried it but I get the ValueError: bad input shape exception. It seems like it doesnt get the multilebel scenario, even though I use OneVsRestClassifier. Any thoughts? – Delli22 Sep 14 '15 at 08:47
  • Which one did you try? The first one? Without the traceback I can't really say. – Andreas Mueller Sep 14 '15 at 13:23
  • OK, some debugging and representation changes and it worked. Thanks! – Delli22 Sep 15 '15 at 07:39
  • 1
    clf.name_steps['f_classif'].get_support() should read as clf.named_steps['f_classif'].get_support() there is a "d" missing at the end of the "named" – Diego Sep 07 '16 at 14:44
  • Hi .. the indices returned via get_support(indices = True) and those returned via the p_values (after selecting the smallest 'k' p-values) don't seem to be the same. Can you please shed some light on this? – rj dj Mar 20 '18 at 13:09
  • It uses the f scores but that shouldn't really make a difference? – Andreas Mueller Mar 20 '18 at 16:42
  • Does Selectkbest in the pipeline executes only on the train phase isn't it? I mean when I do pipe.predict(X_test) selectkbest does not execute anymore. Did I get It right? – Moreno Jun 07 '20 at 20:40