Questions tagged [feature-selection]

In machine learning, this is the process of selecting a subset of most relevant features to construction your data model.

Feature selection is an important step to remove irrelevant or redundant features from our data. For more details, see Wikipedia.

1533 questions
27
votes
3 answers

How to use scikit-learn PCA for features reduction and know which features are discarded

I am trying to run a PCA on a matrix of dimensions m x n where m is the number of features and n the number of samples. Suppose I want to preserve the nf features with the maximum variance. With scikit-learn I am able to do it in this way: from…
gc5
  • 9,468
  • 24
  • 90
  • 151
25
votes
3 answers

Put customized functions in Sklearn pipeline

In my classification scheme, there are several steps including: SMOTE (Synthetic Minority Over-sampling Technique) Fisher criteria for feature selection Standardization (Z-score normalisation) SVC (Support Vector Classifier) The main parameters to…
24
votes
2 answers

How to handle date variable in machine learning data pre-processing

I have a data-set that contains among other variables the time-stamp of the transaction in the format 26-09-2017 15:29:32. I need to find possible correlations and predictions of the sales (lets say in logistic regression). My questions are: How to…
22
votes
4 answers

Difference between PCA (Principal Component Analysis) and Feature Selection

What is the difference between Principal Component Analysis (PCA) and Feature Selection in Machine Learning? Is PCA a means of feature selection?
AbhinavChoudhury
  • 1,167
  • 1
  • 18
  • 38
20
votes
3 answers

Plot feature importance with xgboost

When I plot the feature importance, I get this messy plot. I have more than 7000 variables. I understand the built-in function only selects the most important, although the final graph is unreadable. This is the complete code: import numpy as…
rnv86
  • 790
  • 4
  • 10
  • 22
20
votes
6 answers

Retain feature names after Scikit Feature Selection

After running a Variance Threshold from Scikit-Learn on a set of data, it removes a couple of features. I feel I'm doing something simple yet stupid, but I'd like to retain the names of the remaining features. The following code: def…
18
votes
4 answers

Difference between varImp (caret) and importance (randomForest) for Random Forest

I do not understand which is the difference between varImp function (caret package) and importance function (randomForest package) for a Random Forest model: I computed a simple RF classification model and when computing variable importance, I found…
Rafa OR
  • 339
  • 2
  • 3
  • 8
17
votes
1 answer

apache spark MLLib: how to build labeled points for string features?

I am trying to build a NaiveBayes classifier with Spark's MLLib which takes as input a set of documents. I'd like to put some things as features (i.e. authors, explicit tags, implicit keywords, category), but looking at the documentation it seems…
17
votes
4 answers

Recursive feature elimination on Random Forest using scikit-learn

I'm trying to preform recursive feature elimination using scikit-learn and a random forest classifier, with OOB ROC as the method of scoring each subset created during the recursive process. However, when I try to use the RFECV method, I get an…
Bryan
  • 5,999
  • 9
  • 29
  • 50
16
votes
3 answers

sklearn logistic regression - important features

I'm pretty sure it's been asked before, but I'm unable to find an answer Running Logistic Regression using sklearn on python, I'm able to transform my dataset to its most important features using the Transform method classf =…
mel
  • 161
  • 1
  • 1
  • 3
15
votes
3 answers

Feature importances - Bagging, scikit-learn

For a project I am comparing a number of decision trees, using the regression algorithms (Random Forest, Extra Trees, Adaboost and Bagging) of scikit-learn. To compare and interpret them I use the feature importance , though for the bagging decision…
15
votes
1 answer

find important features for classification

I'm trying to classify some EEG data using a logistic regression model (this seems to give the best classification of my data). The data I have is from a multichannel EEG setup so in essence I have a matrix of 63 x 116 x 50 (that is channels x time…
Mads Jensen
  • 663
  • 2
  • 6
  • 13
14
votes
1 answer

Perform Chi-2 feature selection on TF and TF*IDF vectors

I'm experimenting with Chi-2 feature selection for some text classification tasks. I understand that Chi-2 test checks the dependencies B/T two categorical variables, so if we perform Chi-2 feature selection for a binary text classification problem…
Moses Xu
  • 2,140
  • 4
  • 24
  • 35
13
votes
1 answer

Sklearn Chi2 For Feature Selection

I'm learning about chi2 for feature selection and came across code like this However, my understanding of chi2 was that higher scores mean that the feature is more independent (and therefore less useful to the model) and so we would be interested in…
RSHAP
  • 2,337
  • 3
  • 28
  • 39
13
votes
2 answers

Python's implementation of Mutual Information

I am having some issues implementing the Mutual Information Function that Python's machine learning libraries provide, in particular : sklearn.metrics.mutual_info_score(labels_true, labels_pred,…
and_apo
  • 1,217
  • 3
  • 17
  • 41