Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work
Questions tagged [feature-engineering]
481 questions
1
vote
1 answer
Convert list of dicts with dicts as values to ML features
I want to transform the output of Google Vision API facial recognition into a feature set for a ML classifier. For each training instance I get a list of predicted faces which is represented as a list of dictionaries where the values are themselves…

UlrikP
- 412
- 3
- 8
1
vote
0 answers
Using statistics to extract missing variables in a given dataset?
I would like to know which statistical approach is best suited in data science to introduce new features for a given dataset?
Thanks!

Mohammad Saad
- 1,935
- 10
- 28
1
vote
0 answers
Featuretools taking too long to build features without using CPU cores
I'm using featuretools Deep Feature Sintesys to build features for a dataset of 40k rows and 200 columns. I choose about 40 transformation primitivies, as you can see in the code bellow:
feature_matrix, feature_defs = ft.dfs(entityset=es,…

Alvaro Leandro Cavalcante
- 140
- 1
- 11
1
vote
3 answers
Pandas groupby apply function with an array of functions
I have a dataset like this (example purpose)
df = pd.DataFrame({
'Store' : [100, 100, 100, 100, 101, 101, 101, 101],
'Product' : [5, 3, 10, 1, 3, 11, 2, 5],
'Category' : ['A', 'B', 'C', 'A', 'B', 'A', 'C', 'A'],
'Sales' : [100, 235,…

Ricky
- 635
- 2
- 5
- 20
1
vote
0 answers
How are feature interactions calculated?
How are feature interactions calculated for a pandas dataframe using python? Are there any packages/libraries to calculate feature interactions?

Ailurophile
- 2,552
- 7
- 21
- 46
1
vote
1 answer
How to get trans_primitives of highest entity in featuretools?
In the classic mock customer dataset example in featuretools, if I have to derive trans_primitives like month, day, year etc. of transaction_time attribute of transactions entity. How do I do that?
import featuretools as ft
es =…

Milind Dalvi
- 826
- 2
- 11
- 20
1
vote
0 answers
How to calculate the body size of candle for OHLC for comparison
Need help in understanding how to calculate in python code,the candle body from OHLC and would like to make the following classifications from the OHLC.
STRONG BUY: if next candle's body is outside last candles body AND next candle body > 2x times…

InvestingBetter
- 89
- 6
1
vote
2 answers
Extract Datetime information from a string in a DataFrame column
So I have the Edition Column which contains data in unevenly pattern, as some have ',' followed by the date and some have ',-' pattern.
df.head()
17 Paperback,– 1 Nov 2016
18 Mass Market Paperback,– 1 Jan 1991
19 …

Kushagra
- 61
- 2
- 11
1
vote
0 answers
CatBoost Post-Training Feature Information
I would like to understand how I can access information about numerical and categorical features after training a CatBoost model. For the sake of example, here's some toy code:
import pandas as pd
from catboost import CatBoostClassifier,…

Alex R.
- 1,397
- 3
- 18
- 33
1
vote
1 answer
When should Data Binning be used in data processing?
In data pre-processing, Data Binning is a technique to convert continuous values of a feature to categorical ones. For example, sometimes, the values of age feature in datasets are replaced with one of intervals such…

Javad.Rad
- 93
- 1
- 8
1
vote
0 answers
Why xgboost.get_booster().get_score() doesn't return any value for one of the variables?
A few questions about feature importance in xgboost in python:
I'm trying to print the feature importance using xgboost.get_booster().get_score(). However, the function sometimes doesn't return anything for some variables. Does that mean the score…

khemedi
- 774
- 3
- 9
- 19
1
vote
2 answers
complicated list column to column string matching and deriving another column
Dataframes:
df1:
ind_lst
[agriculture_dairy, analytics]
[architecture_planning, advertising_pr_events, analytics]
df2:
ind score
advertising_pr_events 3.672947168
agriculture_dairy 3.368266582
airlines_aviation_aerospace 3.60798955
analytics…
user14281567
1
vote
1 answer
Use FeatureTools to aggregate monthly data from daily
I'm trying to use FeatureTools to create a dataset for use in customer churn analysis. I have a raw dataset of orders that include fields like:
customer_id, order_id, order_month, order_datetime, order_cost
I'd like to create a dataset that returns…

kevin.w.johnson
- 1,684
- 3
- 18
- 37
1
vote
1 answer
How to recreate new columns with column names from one column & column values from the other
I have 2 columns with list values in my data frame as shown below:
salary.labels salary.percentages
['Not Impacted', 'Salary Not Paid', 'Salary Cut', 'Variables Impacted', 'Appraisal Delayed'] [29, 0.9, 2.2, 11.3, 56.6]
['Not Impacted', 'Salary…

sachin kumar s
- 99
- 3
- 12
1
vote
1 answer
Is it redundant to use df.copy() when writing a function for feature engineering?
I'm wondering if there's any benefit to writing this pattern
def feature_eng(df):
df1 = df.copy()
...
return df1
as opposed to this pattern
def feature_eng(df):
...
return df

Daniel Tan
- 135
- 1
- 2
- 10