Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work
Questions tagged [feature-engineering]
481 questions
1
vote
1 answer
Pipeline & ColumnTransformer: ValueError: Selected columns are not unique in dataframe
Background: I am trying to learn from a notebook used in Kaggle House Price Prediction Dataset.
I am trying to use a Pipeline to transform numerical and categorical columns in a dataframe. It is having issues with my Categorical variables' names,…

Katsu
- 8,479
- 3
- 15
- 16
1
vote
0 answers
What are the features to which fractional differentiation should be applied in ML models?
I am working on supervised machine learning models and I had a couple of doubts about features fractional differentiation.
In particular, it is not clear what are the features to be fractional differentiate.
If I am working with OHLC financial time…

tost
- 23
- 5
1
vote
1 answer
How to extract car model name from the car dataset?
Can anyone help me to extact the car model names from the following sample dataframe?
index,Make,Model,Price,Year,Kilometer,Fuel Type,Transmission,Location,Color,Owner,Seller Type
0,Honda,Amaze 1.2 VX…

Gaurav
- 13
- 3
1
vote
0 answers
GLMM alike solution - adding an interaction step as an element of scikit-learn Pipeline for columns transformed in previous steps
I'm trying to create a solution that will be somehow similar to the Mixed Effects Model (GLMM) that is not present in scikit-learn at the moment. Imagine a simple heart-disease dataset from…

Freejack
- 168
- 10
1
vote
2 answers
Triggered Trapezoid Modelica
I am using a Triggered Trapezoid block within Modelica Logical Blocks.
I am using it on a variable in my model, to eliminate the peaks that occur to this variable, because this variable is triggered by a boolean named ON, and when this boolean is…

Dahmani Merzaka
- 143
- 8
1
vote
0 answers
Backward Elimination in Python - how to write a loop to return insignificant variables in Regression
I am working with a data set with 78 variables and I want to do a backward elimination. I can do this
easily in R (except for when there are categorical variables in with more than 53 levels), but I cannot locate a function to do that in python.
So,…

Zeta10
- 113
- 5
1
vote
1 answer
diff in different periods and variable
I would like to create a function to transform some specific features in a df with the pandas method .diff in the different indicated periods.
I got it in a two step mode, but I am sure this can be one liner, iow, it can be simpler.
Given the…

PeCaDe
- 277
- 1
- 8
- 33
1
vote
1 answer
This is a function to be able to reverse a coordinate in the california housing dataset so as to get the specific address. But I have a problem,
please help, this is the code function. I'll post the error i'm getting when I run the function and use pickle.dump
def location(cord):
latitude=str(cord[0])
longitude=str(cord[1])
location=geolocater.reverse("{},…
1
vote
1 answer
How use Catboost to encode a dataset?
There is a package based on the Catboost algorithm, [https://contrib.scikit-learn.org/category_encoders/_modules/category_encoders/cat_boost.html#CatBoostEncoder] that claims to use catboost algorithm to encode datasets. But it has not had all the…

André Godoy
- 11
- 1
1
vote
2 answers
How to assign new column based on the list of string values in pandas
I have a dataframe that one of the column contains string values, and I want to assign new column if this column values are in the list I specified.
my_list = ['AA', 'TR', 'NZ']
For example:
My dataframe : df
country
AA
TR
SG
The…

Merve
- 43
- 5
1
vote
1 answer
How to match geospatial data using GPS coordinates?
I have data collected from different devices A, B, C and all data were recorded in the format of
Table 1 from device A:
Longtitude Latitude Feature1 Feature2 Feature3
XX.xxx XX.xxx 10.00 20.00 30.00
---
many rows
Table 2 from device…

Wenyao Leo
- 11
- 4
1
vote
0 answers
Features for KMeans using rows instead of columns for a dataframe of size 100Mx10K
I have huge data coming from multiple files (100M rows, 10K columns). Except the first column, all others are floats, and each column from the input corresponds to a sample that needs to be clustered. Unfortunately, this means I need to transpose…

Quiescent
- 1,088
- 7
- 18
1
vote
0 answers
"Most outlier" feature
I am using the Sklearn implementation of Isolation Forest (IF) to detect outliers on a set of data of 20-30 features.
It is working very well, but I would like insight into which feature has the highest impact when an outlier is detected. Please…

Turrini Marco
- 11
- 3
1
vote
4 answers
creating new columns based on several conditions in R
I have a data frame consisting of three columns and the unique values for status are as follows "X" "0" "C" "1" "2" "3" "4" "5". In the beginning, I do not know how to group by each id and create several columns according to the conditions, for…

tara
- 15
- 3
1
vote
1 answer
How to convert daily data into weekly or monthly in python with categorical and numerical column?
I have a daily dataset that has a categorical and numerical column. So, I want to change the daily dataset to the monthly dataset. How can I do that using python? For example, if I have a dataset similar to the picture below how can I bring it in…

Bad Coder
- 177
- 11