Questions tagged [feature-selection]

In machine learning, this is the process of selecting a subset of most relevant features to construction your data model.

Feature selection is an important step to remove irrelevant or redundant features from our data. For more details, see Wikipedia.

1533 questions
13
votes
2 answers

Recursive feature elimination and grid search using scikit-learn

I would like to perform recursive feature elimination with nested grid search and cross-validation for each feature subset using scikit-learn. From the RFECV documentation it sounds like this type of operation is supported using the estimator_params…
DavidS
  • 2,344
  • 1
  • 17
  • 18
13
votes
1 answer

What to do first: Feature Selection or Model Parameters Setting?

This is more of a theoretical question. I'm working with the scikit-learn package to perform some NLP task. Sklearn provides many methods to perform both feature selection and setting of a model parameters. I'm wondering what I should do first. If I…
feralvam
  • 1,603
  • 2
  • 17
  • 20
12
votes
5 answers

Get a feature importance from SHAP Values

iw ould like to get a dataframe of important features. With the code below i have got the shap_values and i am not sure, what do the values mean. In my df are 142 features and 67 experiments, but got an array with ca. 2500 values. explainer =…
Parsyk
  • 321
  • 1
  • 3
  • 11
12
votes
3 answers

All intermediate steps should be transformers and implement fit and transform

I am implementing a pipeline using important features selection and then using the same features to train my random forest classifier. Following is my code. m = ExtraTreesClassifier(n_estimators = 10) m.fit(train_cv_x,train_cv_y) sel =…
Stupid420
  • 1,347
  • 3
  • 19
  • 44
11
votes
2 answers

SciKit-Learn Label Encoder resulting in error 'argument must be a string or number'

I'm a bit confused - creating an ML model here. I'm at the step where I'm trying to take categorical features from a "large" dataframe (180 columns) and one-hot them so that I can find the correlation between the features and select the "best"…
11
votes
2 answers

Dealing with datasets with repeated multivalued features

We have a Dataset that is in sparse representation and has 25 features and 1 binary label. For example, a line of dataset is: Label: 0 exid: 24924687 Features: 11:0 12:1 13:0 14:6 15:0 17:2 17:2 17:2 17:2 17:2 17:2 21:11 21:42 21:42 21:42 21:42…
Mo-
  • 790
  • 2
  • 10
  • 23
11
votes
3 answers

Best practice for holding huge lists of data in Java

I'm writing a small system in Java in which i extract n-gram feature from text files and later need to perform Feature Selection process in order to select the most discriminators features. The Feature Extraction process for a single file return a…
11
votes
2 answers

Choosing Features to identify Twitter Questions as "Useful"

I collect a bunch of questions from Twitter's stream by using a regular expression to pick out any tweet that contains a text that starts with a question type: who, what, when, where etc and ends with a question mark. As such, I end up getting…
bili
  • 610
  • 2
  • 9
  • 20
10
votes
2 answers

Logistic Regression: How to find top three feature that have highest weights?

I am working on UCI breast cancer dataset and trying to find the top 3 features that have highest weights. I was able to find the weight of all features using logmodel.coef_ but how can I get the feature names? Below is my code, output and dataset…
10
votes
3 answers

Fast Information Gain computation

I need to compute Information Gain scores for >100k features in >10k documents for text classification. Code below works fine but for the full dataset is very slow - takes more than an hour on a laptop. Dataset is 20newsgroup and I am using…
9
votes
2 answers

Interpreting logistic regression feature coefficient values in sklearn

I have fit a logistic regression model to my data. Imagine, I have four features: 1) which condition the participant received, 2) whether the participant had any prior knowledge/background about the phenomenon tested (binary response in…
9
votes
2 answers

Attribute's predictive capacity for a particular target in Python, using feature selection in Sklearn

Are there any feature selection methods in Scikit-Learn (or algos in general) that give weights of an attribute's ability/predictive-capacity/importance to predict a specific target? For example, the from sklearn.datasets import load_iris, ranking…
O.rka
  • 29,847
  • 68
  • 194
  • 309
9
votes
1 answer

What does get_fscore() of an xgboost ML model do?

Does anybody how the numbers are calculated? In the documentation it says that this function "Get feature importance of each feature", but there is no explanation on how to interpret the results.
Peter Lenaers
  • 419
  • 3
  • 8
  • 17
9
votes
3 answers

python feature selection in pipeline: how determine feature names?

i used pipeline and grid_search to select the best parameters and then used these parameters to fit the best pipeline ('best_pipe'). However since the feature_selection (SelectKBest) is in the pipeline there has been no fit applied to SelectKBest. I…
figgy
  • 595
  • 2
  • 5
  • 11
9
votes
0 answers

Meaning of GridSearchCV with RFECV in sklearn

Based on Recursive feature elimination and grid search using scikit-learn, I know that RFECV can be combined with GridSearchCV to obtain better parameter setting for the model like linear SVM. As said in the answer, there are two ways: "Run…
Francis
  • 6,416
  • 5
  • 24
  • 32