Feature selection for multilabel classification (scikit-learn)

Question

I'm trying to do a feature selection by chi-square method in scikit-learn (sklearn.feature_selection.SelectKBest). When I'm trying to apply this to a multilabel problem, I get this warning:

UserWarning: Duplicate scores. Result may depend on feature ordering.There are probably duplicate features, or you used a classification score for a regression task. warn("Duplicate scores. Result may depend on feature ordering."

Why is it appearning and how to properly apply feature selection is this case?

Fred Foo · Accepted Answer · 2013-05-07T14:17:10.457

5

The code warns you that arbitrary tie-breaking may need to be performed because some features have exactly the same score.

That said, feature selection does not actually work for multilabel out of the box; the best you can currently do is tie feature selection and a classifier together in a pipeline, then feed that to a multilabel meta-estimator. Example (untested):

clf = Pipeline([('chi2', SelectKBest(chi2, k=1000)),
                ('svm', LinearSVC())])
multi_clf = OneVsRestClassifier(clf)

(This warning is, I believe, issued even when the tied features aren't actually the k'th and (k+1)'th, I think. It can usually be ignored safely.)

edited May 07 '13 at 14:17

answered May 07 '13 at 14:10

Fred Foo

355,277
75
744
836

The warning still appears in this case (when feature selection is in the pipeline) – lizarisk May 08 '13 at 08:09
@lizarisk: I wasn't suggesting that as a fix for the warning, but as a way to do feature selection in the multilabel case. I'm not completely sure if it's the only way to do this; neither I, nor any core dev, ever considered this combination, I think. – Fred Foo May 08 '13 at 08:18
What do you mean by "feature selection does work for multilabel out of the box"? Since it doesn't crash, what is it doing? – Fred Jul 19 '13 at 20:22

score 1 · Answer 2 · answered Jun 16 '14 at 18:56

1

I know the topic is a bit old but the following is working for me:

clf = Pipeline([('chi2', SelectKBest(chi2, k=1000)),
            ('lasso', OneVsRestClassifier(LogisticRegression()))])

answered Jun 16 '14 at 18:56

Stergios

3,126
6
33
55

1

Wouldn't this perform feature selection on the entire training dataset at once rather than separately for each base classifier? I would expect that compared to the accepted answer, you could get bad performance for some labels. – Brian D'Astous Jul 22 '14 at 16:47

Feature selection for multilabel classification (scikit-learn)

2 Answers2

Linked