How to get a list of useless features using sklearn?

Question

I have a dataset to build a classificator:

dataset = pd.read_csv(sys.argv[1], decimal=",",delimiter=";", encoding='cp1251')
X=dataset.ix[:, dataset.columns != 'class']
Y=dataset['class']

I want to select important features only, so I do:

clf=svm.SVC(probability=True, gamma=0.017, C=5, coef0=0.00001, kernel='linear', class_weight='balanced')
model = SelectFromModel(clf, prefit=True)
X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, Y, test_size=0.5, random_state=5)
y_pred=clf.fit(X_train, Y_train).predict(X_test)
X_new = model.transform(X)

So X_new has a shape 3000x72 while X had 3000x130. I would like to get a list of the features which are and are not in X_new. How can I do it?

X was a dataframe with a header, but X_new is a list of lists with feature values without any name, so I can't merge it as I would do in pandas. Thank you for any help!

Could you provide an example as in just few lines of how `X_new` and `X` would look like and what the output for it would be? — Nickil Maveli, Sep 28 '16 at 15:06

score 2 · Answer 1 · answered Sep 28 '16 at 23:23

2

You might also want to take a look at Feature Selection. It describes some techniques and tools to do this more systematically.

answered Sep 28 '16 at 23:23

happyhuman

1,541
1
16
30

score 1 · Answer 2 · answered Feb 01 '17 at 12:28

Try running this code:

import pandas as pd
import numpy as np

dataset = pd.read_csv(sys.argv[1], decimal=",",delimiter=";", encoding='cp1251')
X=dataset.ix[:, dataset.columns != 'class'].values
Y=dataset['class'].values
feature_names = data_churn.columns.tolist()
feature_names.remove('class')

from sklearn.feature_selection import SelectFromModel
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
clf = SVC(probability=True, gamma=0.017, C=5, coef0=0.00001, kernel='linear', class_weight='balanced')
model = SelectFromModel(clf, prefit=True)
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.5, random_state=5)
y_pred=clf.fit(X_train, Y_train).predict(X_test)
X_new = model.transform(X)
print pd.DataFrame(np.c_[feature_names, model.get_support(0)],
                         columns=[ 'feature_name', 'feature_selected'])

The 'feature_selected' columns shows if the feature is selected or not.

score 0 · Accepted Answer · answered Sep 28 '16 at 14:15

0

clf.coef_ returns you a list of feature weights (apply after fit()). Sort it by weights and you see which are not very useful.

answered Sep 28 '16 at 14:15

sergzach

6,578
7
46
84

but if i'm not mistaken it doesn't give me a list of feature names, just ordered coefficients which I already got by selectfrommodel – Polly Sep 28 '16 at 14:38
1

Your classifier knows nothing about names in the initial DataFrame so I'd recommend to build it manually, something like `weights = pd.DataFrame({'features': df.columns, 'weights': clf.coef_})` – arsenyinfo Sep 28 '16 at 18:14
@arsenyinfo I think you need not feature names. The order is the same as in your objects (X). – sergzach Sep 28 '16 at 19:19

How to get a list of useless features using sklearn?

3 Answers3