0

I have created a binary classification model for a text using sklearn logistic regression model. Now I want to select the features used for model. My code looks like this-

train, val, y_train, y_test = train_test_split(np.arange(data.shape[0]), lab, test_size=0.2, random_state=0)
X_train = data[train]
X_test = data[val]

#X_train, X_test, y_train, y_test = train_test_split(data, lab, test_size=0.2)
tfidf_vect = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df = 0, stop_words = 'english')
X_tfidf_train = tfidf_vect.fit_transform(X_train)
X_tfidf_test = tfidf_vect.transform(X_test)
clf_lr = LogisticRegression(penalty='l1')
clf_lr.fit(X_tfidf_train, y_train)
feature_names = tfidf_vect.get_feature_names()
print len(feature_names)
y_pred_lr = clf_lr.predict_proba(X_tfidf_test)[:, 1]

What will be the best approach to do this.

Y0gesh Gupta
  • 2,184
  • 5
  • 40
  • 56

1 Answers1

1

you can use sklearn.feature_selection

here's a link of how you can use it http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE

M.achaibou
  • 91
  • 3
  • 15
  • Will it give the same features used to build the logistic regression model – Y0gesh Gupta Sep 18 '17 at 18:45
  • 1
    well , usually it's used to predict which features give the high score , and then we eliminate the others You can use this information to create filtered versions of your dataset and increase the accuracy of your models. – M.achaibou Sep 18 '17 at 18:55