Questions tagged [feature-selection]

In machine learning, this is the process of selecting a subset of most relevant features to construction your data model.

Feature selection is an important step to remove irrelevant or redundant features from our data. For more details, see Wikipedia.

1533 questions
7
votes
1 answer

How SelectKBest (chi2) calculates score?

I am trying to find the most valuable features by applying feature selection methods to my dataset. Im using the SelectKBest function for now. I can generate the score values and sort them as I want, but I don't understand exactly how this score…
7
votes
1 answer

feature names from sklearn pipeline: not fitted error

I'm working with scikit learn on a text classification experiment. Now I would like to get the names of the best performing, selected features. I tried some of the answers to similar questions, but nothing works. The last lines of code are an…
Bambi
  • 715
  • 2
  • 8
  • 19
7
votes
1 answer

How is feature importance calculated for GradientBoostingClassifier

I'm using scikit-learn's gradient-boosted trees classifier, GradientBoostingClassifier. It makes feature importance score available in feature_importances_. How are these feature importances calculated? I'd like to understand what algorithm…
D.W.
  • 3,382
  • 7
  • 44
  • 110
7
votes
2 answers

Multi-label feature selection using sklearn

I'm looking to perform feature selection with a multi-label dataset using sklearn. I want to get the final set of features across labels, which I will then use in another machine learning package. I was planning to use the method I saw here, which…
7
votes
2 answers

How can I get the relative importance of features of a logistic regression for a particular prediction?

I am using a Logistic Regression (in scikit) for a binary classification problem, and am interested in being able to explain each individual prediction. To be more precise, I'm interested in predicting the probability of the positive class, and…
7
votes
3 answers

How does sklearn random forest index feature_importances_

I have used the RandomForestClassifier in sklearn for determining the important features in my dataset. How am I able to return the actual feature names (my variables are labeled x1, x2, x3, etc.) rather than their relative name (it tells me the…
7
votes
2 answers

Normalizing feature values for SVM

I've been playing with some SVM implementations and I am wondering - what is the best way to normalize feature values to fit into one range? (from 0 to 1) Let's suppose I have 3 features with values in ranges of: 3 - 5. 0.02 - 0.05 10-15. How do I…
user3010273
  • 890
  • 5
  • 11
  • 18
6
votes
2 answers

sklearn Pipeline: argument of type 'ColumnTransformer' is not iterable

I am attempting to use a pipeline to feed an ensemble voting classifier as I want the ensemble learner to use models that train on different feature sets. For this purpose, I followed the tutorial available at [1]. Following is the code that I…
6
votes
1 answer

Getting TypeError: '(slice(None, None, None), array([0, 1, 2, 3, 4]))' is an invalid key

Trying to use BorutaPy for feature selection. but getting a TypeError: '(slice(None, None, None), array([0, 1, 2, 3, 4]))' is an invalid key. from sklearn.ensemble import RandomForestClassifier from boruta import BorutaPy rf =…
6
votes
1 answer

Sentiment analysis Pipeline, problem getting the correct feature names when feature selection is used

In the following example I use a twitter dataset to perform sentiment analysis. I use sklearn pipeline to perform a sequence of transformations, add features and add a classifer. The final step is to visualise the words that have the higher…
6
votes
2 answers

How to handle One-Hot Encoding in production environment when number of features in Training and Test are different?

While doing certain experiments, we usually train on 70% and test on 33%. But, what happens when your model is in production? The following may occur: Training Set: ----------------------- | Ser |Type Of Car | ----------------------- | 1 |…
6
votes
2 answers

python spark: narrowing down most relevant features using PCA

I am using spark 2.2 with python. I am using PCA from ml.feature module. I am using VectorAssembler to feed my features to PCA. To clarify, let's say I have a table with three columns col1, col2 and col3 then I am doing: from pyspark.ml.feature…
6
votes
1 answer

Wrapper Methods for feature selection (Machine Learning) In Scikit Learn

I am trying to decide between scikit learn and the weka data mining tool for my machine learning project. However I realized the need for feature selection. I would like to know if scikit learn has wrapper methods for feature selection.
Sean Sog Miller
  • 207
  • 1
  • 4
  • 11
6
votes
1 answer

How to efficiently retrieve top K-similar document by cosine similarity using python?

I am handling one hundred thousand(100,000) documents(mean document length is about 500 terms). For each document, I want to get the top k (e.g. k = 5) similar documents by cosine similarity. So how to efficiently do this by Python. Here is what I…
user1024
  • 982
  • 4
  • 13
  • 26
6
votes
1 answer

how to change feature weight when training a model with sklearn?

I want to classifier text by using sklearn. first I used bag of words to training the data, the feature of bag of words are really large, more than 10000 features, so I reduced this feature by using SVD to 100. But here I want to add some other…
HAO CHEN
  • 1,209
  • 3
  • 18
  • 32