7

I'm looking to perform feature selection with a multi-label dataset using sklearn. I want to get the final set of features across labels, which I will then use in another machine learning package. I was planning to use the method I saw here, which selects relevant features for each label separately.

from sklearn.svm import LinearSVC
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.multiclass import OneVsRestClassifier
clf = Pipeline([('chi2', SelectKBest(chi2, k=1000)),
                ('svm', LinearSVC())])
multi_clf = OneVsRestClassifier(clf)

I then plan to extract the indices of the included features, per label, using this:

selected_features = []
for i in multi_clf.estimators_:
    selected_features += list(i.named_steps["chi2"].get_support(indices=True))

Now, my question is, how do I choose which selected features to include in my final model? I could use every unique feature (which would include features that were only relevant for one label), or I could do something to select features that were relevant for more labels.

My initial idea is to create a histogram of the number of labels a given feature was selected for, and to identify a threshold based on visual inspection. My concern is that this method is subjective. Is there a more principled way of performing feature selection for multilabel datasets using sklearn?

Community
  • 1
  • 1
Taylor
  • 151
  • 1
  • 2
  • 9

2 Answers2

11

According to the conclusions in this paper:

[...] rank features according to the average or the maximum Chi-squared score across all labels, led to most of the best classifiers while using less features.

Then, in order to select a good subset of features you just need to do (something like) this:

from sklearn.feature_selection import chi2, SelectKBest

selected_features = [] 
for label in labels:
    selector = SelectKBest(chi2, k='all')
    selector.fit(X, Y[label])
    selected_features.append(list(selector.scores_))

// MeanCS 
selected_features = np.mean(selected_features, axis=0) > threshold
// MaxCS
selected_features = np.max(selected_features, axis=0) > threshold

Note: in the code above I'm assuming that X is the output of some text vectorizer (the vectorized version of the texts) and Y is a pandas dataframe with one column per label (so I can select the column Y[label]). Also, there is a threshold variable that should be fixed beforehand.

mac2bua
  • 138
  • 1
  • 8
1

http://scikit-learn.org/stable/modules/feature_selection.html

There is a multitude of options, but SelectKBest and Recursive feature elimination are two reasonably popular ones.

RFE works by leaving uniformative features out of the model, and retraining, and comparing the results, so that the features left at the end are the ones which enable the best prediction accuracy.

What is best is highly dependant on your data and use case.

Aside from what can loosely be described as cross validation approaches to feature selection, you can look at Bayesian model selection, which is a more theoretical approach and tends to favor more simple models over complex ones.

Chris
  • 957
  • 5
  • 10
  • 1
    From what I understand, the feature selection methods in sklearn are for binary classifiers. You can get the selected features for each label individually, but my question is how to determine a final set of features that work across all labels in a principled manner. – Taylor May 05 '16 at 16:11
  • I'm not sure I understand what you mean. For example, SelectKBest is model independant, and you can see an example of RFE which shows you how to get the final feature set in the docs. http://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_with_cross_validation.html#example-feature-selection-plot-rfe-with-cross-validation-py Additionally many/most implement a transform feature which will select the trained best features from the inputs. – Chris May 05 '16 at 16:18
  • But that final feature set is for one classifier, right? Since binary relevance methods break the multilabel classification problem down into a series of binary classifications, that final feature set corresponds to only one of my many labels. I'll have a feature set returned by the feature selection methods for each of my individual labels, but I want to combine the selected features to create a feature set that works well for all labels. – Taylor May 05 '16 at 17:20
  • The link I posted has a working example of an 8 class, 25 feature classification problem, that the RFECV feature selection method works on. I understand that the one v all classifier may not work as you would like, but there a tonnes of other methods which do exactly what you want, as i linked to. Feature selection does not have to be done with the same model that you use in the final form (and there is some validity to the idea that that approach can increase the ease with which models overfit). – Chris May 06 '16 at 07:14