Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work
Questions tagged [feature-engineering]
481 questions
1
vote
0 answers
How to feed key-value features (aggregated data) to LSTM?
I have the following time-series aggregated input for an LSTM-based model:
x(0): {y(0,0): {a(0,0), b(0,0)}, y(0,1): {a(0,1), b(0,1)}, ..., y(0,n): {a(0,n), b(0,n)}}
x(1): {y(1,0): {a(1,0), b(1,0)}, y(1,1): {a(1,1), b(1,1)}, ..., y(1,n): {a(1,n),…

Maximus
- 471
- 1
- 10
- 25
1
vote
1 answer
Pandas qcut apply on new data result in NaN
I am binning for a modelling project and I ran into this problem.
This example acquire bins using dataframe without 11, this result in a NaN when bins is applied to a new dataframe with 11 in it. Obviously this will happen, but I wonder if there…

noodle cold
- 39
- 4
1
vote
0 answers
Event driven approach to update dependency files needed for calculating features in production system
I have a production system use-case where my controller code depends on some external files (Metadata information of some relevant business logic; 3-5 JSON files which in total would amount to 1GB of data) which gets updated frequently to create…

here_to_learn
- 179
- 2
- 11
1
vote
1 answer
Hash trick in sklearn FeatureHasher
Wanting to understand "the hashing trick" I've written the following test code:
import pandas as pd
from sklearn.feature_extraction import FeatureHasher
test = pd.DataFrame({'type': ['a', 'b', 'c', 'd', 'e','f','g','h']})
h =…

Roni Gadot
- 437
- 2
- 19
- 30
1
vote
2 answers
preserving order information in a single feature
The following is one column of a dataset that I'm trying to feature engineer:
+---+-----------------------------+
|Id |events_list |
+---+-----------------------------+
|1 |event1,event3,event2,event1 …

Shlomi Schwartz
- 8,693
- 29
- 109
- 186
1
vote
0 answers
When creating a new feature of similarity in ham vs spam case, should I include the similarity of spam with itself in the average of samp similarity?
I want to improve my model by adding a new feature column to my data, the data of ham and spam texts.
I have already created the square Cosine similarity matrix between all the texts, the diagonal of the matrix are 1s = cos(0).
I extract all the…

yshi50
- 11
- 2
1
vote
1 answer
Featuretools: Using features calculated in train data on new data
I was wondering how to use features developed in train time for prediction on new data. The dataset in question is the appointment cancellation dataset from Predict appointment no show, Github
Consider the feature locations.PERCENT_TRUE(no_show):…

Arun
- 180
- 11
1
vote
1 answer
Handling a missing value in machine learning
I was analyzing a dataset in which i have column names as follows: [id , location, tweet, target_value]. I want to handle the missing values for column location in some rows. So i thought to extract location from tweet column from that row(if tweet…

Deepak Chaudhary
- 93
- 1
- 8
1
vote
1 answer
How do I convert topics for each item in the dataset into a feature vector, considering that each item can have more than 1 topic
I have a dataset which contains english statements. Each statement has been assigned a number of topics that the statement is about. The topics could be economy, sports, politics, business, science, etc. Each statement can have more than 1 topic.…

Saad Farooq
- 39
- 1
- 8
1
vote
1 answer
What is the proper way of using featuretools for single table data?
Assume that I have a dataset consisting of single table, for instance you can consider titanic dataset on kaggle.
Now what is a proper way of using feature tools to get most benefit from it? as featuretools is specially for relational data.
now by…

Graphics Engineer
- 95
- 1
- 1
- 7
1
vote
0 answers
Broadcast error when using autofeat for automated feature engineering
When trying to use autofeat(https://github.com/cod3licious/autofeat) to automatically generate new features, I am receiving the following error:
operands could not be broadcast together with shapes (963,) (962,)
simple code:
model =…

Graphics Engineer
- 95
- 1
- 1
- 7
1
vote
1 answer
Compute combination of a pair variables for a given operation in R
From a given dataframe:
# Create dataframe with 4 variables and 10 obs
set.seed(1)
df<-data.frame(replicate(4,sample(0:1,10,rep=TRUE)))
I would like to compute a substract operation between in all columns combinations by pairs, but only keeping one…

PeCaDe
- 277
- 1
- 8
- 33
1
vote
1 answer
How to create new variables by multiple ids in featuretools?
I have a dataset that has one row per member and per transaction, and there are different stores the purchase could have came from 'brand_id'. I want to use featuretools to make output that would have one row per member, with an aggregate of…

Nate Thompson
- 625
- 1
- 7
- 22
1
vote
1 answer
Is it a bad idea to use the cluster ID from clustering text data using K-means as feature to your supervised learning model?
I am building a model that will predict the lead time of products flowing through a pipeline.
I have a lot of different features, one is a string containing a few words about the purpose of the product (often abbreviations, name of the application…

kspr
- 980
- 9
- 23
1
vote
0 answers
KeyError: 'Entity c does not exist in dfs'
when i try to run this code,
ftr_mtrx_custmr, features_defs = ft.dfs(entities=entities,
relationships=relationship,
target_entity="transactions")
i get such error,
490…

Ron
- 11
- 1