Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work
Questions tagged [feature-engineering]
481 questions
2
votes
1 answer
Pandas Dataframe, TensorFlow Dataset: Where to do the TensorFlow Tokenization step?
I am working on a logistic regression model to predict if a customer is a business or non-business costumer with the help of Keras in TensorFlow. At the moment I am able to use columns like latitude with the help of tf.feature_columns. Now I am…

Ling
- 449
- 6
- 21
2
votes
1 answer
Dealing with Longitude and Latitude in Feature Engineering
I have a dataset which contains information about houses worldwide with the following features: house size, number of bedrooms, city name, country name, garden or not, ... (and many other typical house information). And the target variable is the…

colla
- 717
- 1
- 10
- 22
2
votes
1 answer
How to Label rows values (condition based) using dplyr in R to create new features
Original data set is similar to dummy data set, here I have created an new column total sales based on sum of day sales, also I have sorted the df basis descending order of total sales value
library(dplyr)
empid <- c(10,11,12,13,14,15) # Employee…

rajeswa
- 47
- 9
2
votes
0 answers
Tensorflow Estimator Feature Column increase weight
I have a DNNLinearCombinedClassifier to predict if an article get sold or not. I need DNN for feature like description and Linear for features like size, category, price, etc. In general it works, but the weight of the price is too low. The price is…

NiBurhe
- 93
- 6
2
votes
1 answer
Is there a way in R to determine which levels within the variables are most important in the GBM predictive model?
I constructed a predictive model using the GBM package in R. I have good results and I am able to see the feature importance list to see which variables are most important to the model. I am struggling with an editor's question asking for direction…

ClareFG
- 65
- 1
- 11
2
votes
2 answers
Fit clustering outputs into Machine Learning model
Just a machine learning/data science problem.
a) Let's say I have a dataset of 20 features, and i decide to use 3 features to perform unsupervised learning of clustering - and ideally this produces 3 clusters (A,B and C).
b) Then i fit that output…

Gabriel
- 438
- 1
- 5
- 16
2
votes
1 answer
Continuous update of Aggregation of last 5 data sets in python
I need to add a new feature that aggregates the last 5 data. When it adds 6th data, then it should forget the first data and consider only the last 5 data sets as shown below. Here is the dummy data frame, new_feature is the expected output.
id …

Divya
- 23
- 4
2
votes
3 answers
OneHotEncoder ValueError: Found unknown categories
I am building the OneHotEncoder using the full file.
def buildOneHotEncoder(training_file_name, categoricalCols):
one_hot_encoder = OneHotEncoder(sparse=False)
df = pd.read_csv(training_file_name, skiprows=0, header=0)
df =…

jeevs
- 261
- 6
- 20
2
votes
1 answer
R how to lag 4000 columns 50 times
I have a data frame with 4000 columns and daily observations sorted by time. I want to create new columns that lag all existing columns 50 times in the past. So for a column Y create 50 additional columns that are…

nba2020
- 618
- 1
- 8
- 22
2
votes
1 answer
Is normalization necessary for RandomForest?
1) Is normalization necessary for Random Forests?
2) Should all the features be normalized or only numerical ones?
3) Does it matter whether I normalize before or after splitting into train and test data?
4) Do I need to pre-process features of…

The Hidden Reverse
- 31
- 4
2
votes
2 answers
Pandas reset_index() is not working after grouping by and aggregating by multiple methods
I have a pandas DataFrame with 2 grouping columns and 3 numeric columns.
I am grouping the data like this:
df = df.groupby(['date_week', 'uniqeid']).agg({
'completes':['sum', 'median', 'var', 'min', 'max']
,'dcount_visitors': ['sum',…

R_Queery
- 497
- 1
- 9
- 19
2
votes
1 answer
Suggestions for feature engineering
I am having a problem during feature engineering. Looking for some suggestions. Problem statement: I have usage data of multiple customers for 3 days. Some have just 1 day usage some 2 and some 3. Data is related to number of emails sent /…

SSuram
- 61
- 4
2
votes
2 answers
Why Tensorflow error: `failed to convert object of type to Tensor` happens and How can I solve it?
I am doing a task on traffic analysis and I am stymied with some error in my code. My data rows are like this:
qurter | DOW (Day of week)| Hour | density | speed | label (predicted speed for another half an hour)
The values are like this:
1, 6, 19,…

Masoud Masoumi Moghadam
- 1,094
- 3
- 23
- 45
2
votes
0 answers
Writing a dask bag of data frame to disk (Generating 2 million features with dask and featuretools)
I'm very new to both Dask and Featuretools so I'm having alot of difficulties combining them to parallelize feature engineering
Short version: solving an immediate problem
I have a dask bag dfs of pandas DataFrame and want to output them as csv with…

An Hoang
- 21
- 3
2
votes
1 answer
FeatureTools: Dealing with many-to-many relationships
I have a dataframe of purchases with multiple columns, including the three below:
PURCHASE_ID (index of purchase)
WORKER_ID (index of worker)
ACCOUNT_ID (index of account)
A worker can have multiple accounts associated to them, and an account…

LEJ
- 1,868
- 4
- 16
- 24