Questions tagged [feature-engineering]

Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work

481 questions
1
vote
0 answers

How to feed key-value features (aggregated data) to LSTM?

I have the following time-series aggregated input for an LSTM-based model: x(0): {y(0,0): {a(0,0), b(0,0)}, y(0,1): {a(0,1), b(0,1)}, ..., y(0,n): {a(0,n), b(0,n)}} x(1): {y(1,0): {a(1,0), b(1,0)}, y(1,1): {a(1,1), b(1,1)}, ..., y(1,n): {a(1,n),…
1
vote
1 answer

Pandas qcut apply on new data result in NaN

I am binning for a modelling project and I ran into this problem. This example acquire bins using dataframe without 11, this result in a NaN when bins is applied to a new dataframe with 11 in it. Obviously this will happen, but I wonder if there…
1
vote
0 answers

Event driven approach to update dependency files needed for calculating features in production system

I have a production system use-case where my controller code depends on some external files (Metadata information of some relevant business logic; 3-5 JSON files which in total would amount to 1GB of data) which gets updated frequently to create…
1
vote
1 answer

Hash trick in sklearn FeatureHasher

Wanting to understand "the hashing trick" I've written the following test code: import pandas as pd from sklearn.feature_extraction import FeatureHasher test = pd.DataFrame({'type': ['a', 'b', 'c', 'd', 'e','f','g','h']}) h =…
Roni Gadot
  • 437
  • 2
  • 19
  • 30
1
vote
2 answers

preserving order information in a single feature

The following is one column of a dataset that I'm trying to feature engineer: +---+-----------------------------+ |Id |events_list | +---+-----------------------------+ |1 |event1,event3,event2,event1 …
Shlomi Schwartz
  • 8,693
  • 29
  • 109
  • 186
1
vote
0 answers

When creating a new feature of similarity in ham vs spam case, should I include the similarity of spam with itself in the average of samp similarity?

I want to improve my model by adding a new feature column to my data, the data of ham and spam texts. I have already created the square Cosine similarity matrix between all the texts, the diagonal of the matrix are 1s = cos(0). I extract all the…
yshi50
  • 11
  • 2
1
vote
1 answer

Featuretools: Using features calculated in train data on new data

I was wondering how to use features developed in train time for prediction on new data. The dataset in question is the appointment cancellation dataset from Predict appointment no show, Github Consider the feature locations.PERCENT_TRUE(no_show):…
1
vote
1 answer

Handling a missing value in machine learning

I was analyzing a dataset in which i have column names as follows: [id , location, tweet, target_value]. I want to handle the missing values for column location in some rows. So i thought to extract location from tweet column from that row(if tweet…
1
vote
1 answer

How do I convert topics for each item in the dataset into a feature vector, considering that each item can have more than 1 topic

I have a dataset which contains english statements. Each statement has been assigned a number of topics that the statement is about. The topics could be economy, sports, politics, business, science, etc. Each statement can have more than 1 topic.…
1
vote
1 answer

What is the proper way of using featuretools for single table data?

Assume that I have a dataset consisting of single table, for instance you can consider titanic dataset on kaggle. Now what is a proper way of using feature tools to get most benefit from it? as featuretools is specially for relational data. now by…
1
vote
0 answers

Broadcast error when using autofeat for automated feature engineering

When trying to use autofeat(https://github.com/cod3licious/autofeat) to automatically generate new features, I am receiving the following error: operands could not be broadcast together with shapes (963,) (962,) simple code: model =…
1
vote
1 answer

Compute combination of a pair variables for a given operation in R

From a given dataframe: # Create dataframe with 4 variables and 10 obs set.seed(1) df<-data.frame(replicate(4,sample(0:1,10,rep=TRUE))) I would like to compute a substract operation between in all columns combinations by pairs, but only keeping one…
PeCaDe
  • 277
  • 1
  • 8
  • 33
1
vote
1 answer

How to create new variables by multiple ids in featuretools?

I have a dataset that has one row per member and per transaction, and there are different stores the purchase could have came from 'brand_id'. I want to use featuretools to make output that would have one row per member, with an aggregate of…
1
vote
1 answer

Is it a bad idea to use the cluster ID from clustering text data using K-means as feature to your supervised learning model?

I am building a model that will predict the lead time of products flowing through a pipeline. I have a lot of different features, one is a string containing a few words about the purpose of the product (often abbreviations, name of the application…
1
vote
0 answers

KeyError: 'Entity c does not exist in dfs'

when i try to run this code, ftr_mtrx_custmr, features_defs = ft.dfs(entities=entities, relationships=relationship, target_entity="transactions") i get such error, 490…
Ron
  • 11
  • 1