Questions tagged [feature-extraction]

In pattern recognition and in image processing, feature extraction is a special form of dimensionality reduction. Transforming the input data into the set of features is called feature extraction. If the features extracted are carefully chosen it is expected that the features set will extract the relevant information from the input data in order to perform the desired task using this reduced representation instead of the full size input.

Feature extraction involves simplifying the amount of resources required to describe a large set of data accurately. When performing analysis of complex data one of the major problems stems from the number of variables involved. Analysis with a large number of variables generally requires a large amount of memory and computation power or a classification algorithm which overfits the training sample and generalizes poorly to new samples. Feature extraction is a general term for methods of constructing combinations of the variables to get around these problems while still describing the data with sufficient accuracy.

Best results are achieved when an expert constructs a set of application-dependent features. Nevertheless, if no such expert knowledge is available general dimensionality reduction techniques may help.

Source: Wikipedia

1664 questions
16
votes
1 answer

Combining feature extraction classes in scikit-learn

I'm using sklearn.pipeline.Pipeline to chain feature extractors and a classifier. Is there a way to combine multiple feature selection classes (for example the ones from sklearn.feature_selection.text) in parallel and join their output? My code…
Daniel
  • 26,899
  • 12
  • 60
  • 88
15
votes
2 answers

Tensorflow feature column for variable list of values

From the TensorFlow docs it's clear how to use tf.feature_column.categorical_column_with_vocabulary_list to create a feature column which takes as input some string and outputs a one-hot vector. For example vocabulary_feature_column = …
14
votes
4 answers

How to deal with array of string features in traditional machine learning?

Problem Let's say we have a dataframe that looks like this: age job friends label 23 'engineer' ['World of Warcraft', 'Netflix', '9gag'] 1 35 'manager' NULL …
14
votes
4 answers

TSFRESH library for python is taking way too long to process

I came across the TSfresh library as a way to featurize time series data. The documentation is great, and it seems like the perfect fit for the project I am working on. I wanted to implement the following code that was shared in the quick start…
Michael Bawol
  • 373
  • 2
  • 11
14
votes
2 answers

CountVectorizer: "I" not showing up in vectorized text

I'm new to scikit-learn, and currently studying Naïve Bayes (Multinomial). Right now, I'm working on vectorizing text from sklearn.feature_extraction.text, and for some reason, when I vectorize some text, the word "I" doesn't show up in the…
covariance
  • 6,833
  • 7
  • 23
  • 24
12
votes
1 answer

How to encode dependency path as a feature for classification?

I am trying to implement relation extraction between verb pairs. I want to use dependency path from one verb to the other as a feature for my classifier (predicts if relation X exists or not). But I am not sure how to encode the dependency path as a…
12
votes
1 answer

Extract single line contours from Canny edges

I'd like to extract the contours of an image, expressed as a sequence of point coordinates. With Canny I'm able to produce a binary image that contains only the edges of the image. Then, I'm trying to use findContours to extract the contours. The…
Muffo
  • 1,733
  • 2
  • 19
  • 29
12
votes
1 answer

How to calculate Local Binary Pattern Histograms with OpenCV?

I have already seen that OpenCV provides a classifier based on LBP histograms: But I want to have access to the LBP histogram itself. For instance: histogram = calculate_LBP_Histogram( image ) Is there any function that performs this in OpenCV?
EijiAdachi
  • 441
  • 1
  • 3
  • 15
11
votes
2 answers

Is it possible to query Elastic Search with a feature vector?

I'd like to store an n-dimensional feature vector, e.g. <1.00, 0.34, 0.22, ..., 0>, with each document, and then provide another feature vector as a query, with the results sorted in order of cosine similarity. Is this possible with Elastic Search?
neptune
  • 1,380
  • 2
  • 17
  • 25
11
votes
3 answers

Best practice for holding huge lists of data in Java

I'm writing a small system in Java in which i extract n-gram feature from text files and later need to perform Feature Selection process in order to select the most discriminators features. The Feature Extraction process for a single file return a…
11
votes
3 answers

extracting pitch features from audio file

I am trying to extract pitch features from an audio file which I would use for a classification problem. I am using python(scipy/numpy) for classification. I think I can get frequency features using scipy.fft but I don't know how to approximate…
Ada Xu
  • 953
  • 4
  • 14
  • 31
10
votes
2 answers

Feature Hashing on multiple categorical features(columns)

I would like to hash feature ‘Genre’ into 6 columns and separately feature ‘Publisher’ into another six columns. I want something like below: Genre Publisher 0 1 2 3 4 5 0 1 2 3 4 5 0 Platform …
Noor
  • 126
  • 2
  • 8
10
votes
1 answer

Understanding the output of mfcc

from librosa.feature import mfcc from librosa.core import load def extract_mfcc(sound): data, frame = load(sound) return mfcc(data, frame) mfcc = extract_mfcc("sound.wav") I would like to get the MFCC of the following sound.wav file…
10
votes
2 answers

RandomForestRegressor and feature_importances_ error

I am struggling to pull out the feature importances from my RandomForestRegressor, I get an: AttributeError: 'GridSearchCV' object has no attribute 'feature_importances_'. Anyone know why there is no attribute? According to documentation there…
10
votes
1 answer

Empty vocabulary for single letter by CountVectorizer

Trying to convert string into numeric vector, ### Clean the string def names_to_words(names): print('a') words = re.sub("[^a-zA-Z]"," ",names).lower().split() print('b') return words ### Vectorization def Vectorizer(): …
LookIntoEast
  • 8,048
  • 18
  • 64
  • 92