Questions tagged [feature-engineering]

Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work

481 questions
2
votes
1 answer

During calculation of "distance average" in knn imputation method for replacing NaN value in particular column

I encounter this problem when I implement the Knn imputation method for handling missing data from scratch. I create a dummy dataset and find the nearest neighbors for rows that contain missing values here is my dataset A B C D …
2
votes
0 answers

Putting weights on values of a categorical feature

Suppose we have the following dataset df = pd.DataFrame({'feature 1':['a','b','c','d','e'], 'feature 2':[1,2,3,4,5],'y':[1,0,0,1,1]}) as we can see feature 1 is categorical. In usual tree-based models as in XGBoost or CatBoost, the values under…
2
votes
3 answers

In a pandas column, how to find the max number of consecutive rows that a particular value occurs?

Let's say we have the following df with the column names. df = pd.DataFrame({ 'names':['Alan', 'Alan', 'John', 'John', 'Alan', 'Alan','Alan', np.nan, np.nan, np.nan, np.nan, np.nan, 'Christy', 'Christy','John']}) >>> df names 0 …
elixir
  • 173
  • 2
  • 13
2
votes
0 answers

Embed row of data from dataframe into single vector or array values

Is there any way I can embed or any process to capture each of the row data turn into vector, or array number which is in shape (1,)? My intention is to embed each of the rows information become something to representative input feature, so that I…
Yeo Keat
  • 143
  • 1
  • 9
2
votes
1 answer

How to create binary variable for each individual based on value in other variable?

So I have a data set containing of 4 individuals. Each individual is measured for different time period. In R: df = data.frame(cbind("id"=c(1,1,1,2,2,3,3,3,3,4,4), "t"=c(1,2,3,1,2,1,2,3,4,1,2), "x1"=c(0,1,0,1,0,0,1,0,1,0,0))) and I want to create…
pikachu
  • 690
  • 1
  • 6
  • 17
2
votes
2 answers

deep feature synthesis depth for transformation primitives | featuretools

I am trying to use the featuretools library to make new features on a simple dataset, however, whenever I try to use a bigger max_depth, nothing happens... Here is my code so far: # imports import featuretools as ft # creating the EntitySet es =…
2
votes
2 answers

Variable creation - Inferring age

I have a grouped dataframe; Truck <- c('A','A','A','A','B','B','B','B','C','C','C','C') OilChanged <- c('True','NewOil','False','False','False','False','False','False','True','NewOil','True','NewOil') Odometer <- c(1000, 1000,…
Brad
  • 580
  • 4
  • 19
2
votes
2 answers

Convert a column of list of dictionaries to a column list such that the values are derived from the key "name" under each dictionary in the list

The input column has a variable number of dictionary lists, it is not fixed. INPUT column: Facilities [{'name': 'Work from home', 'icon': 'WFH.svg'}] [{'name': 'Gymnasium', 'icon': 'Gym.svg'}, {'name': 'Cafeteria', 'icon': 'Cafeteria.svg'},…
2
votes
1 answer

Pandas: count identical values in columns but from different index

I have a data frame representing the customers ratings of restaurants. rating_year is the year the rating was made, first_year is the year the restaurant opened and last_year is the last business year of a restaurant. What i want to do is calculate…
Lynn
  • 121
  • 8
  • 25
2
votes
1 answer

Pandas for binary classification

I have using Pandas for data processing before training a binary classifier. One of the things I could not find was a function that tells me given a value of a certain feature, let's say Age (people who are for example 60 years old) which percentage…
erni
  • 57
  • 7
2
votes
0 answers

Boxcox transformation with tree-based models(XGBoost to be specific)

I have a question regarding boxcox transformation(or log transformation). I am working on a data-set which I have lots of skewed features. Now when I take the boxcox transformation, I get quite a nice distribution but the thing is correlation…
CheeseBurger
  • 175
  • 5
2
votes
1 answer

Using the column operator to check if pass or fail

I'm not sure if how can I use the operators column for me to return a pandas series where it will determine if a certain row's activity will pass or fail based from it's passing score, operator and actual. Dataset Sample: data={"ID": [1,1,2,2], …
Maku
  • 1,476
  • 10
  • 21
2
votes
1 answer

Create one new column in pandas dataframe comprised of previous year stats for each player in the dataframe

(python) I currently have a pandas dataframe that looks something like this: player | year | points | ----------------------------------------------- LeSean McCoy | 2012 | 199.3 …
ekselan
  • 137
  • 1
  • 10
2
votes
1 answer

Python featuretools difference by data group

I'm trying to use featuretools to calculate time-series functions. Specifically, I'd like to subtract current(x) from previous(x) by a group-key (user_id), but I'm having trouble in adding this kind of relationship in the entityset. df =…
2
votes
2 answers

Pandas - most recent match relative to current row

I would like to add a new column to my dataframe that contains the most recent 'revenue' value where 'promotion' == 1, excluding the current row. The dataframe will always be sorted by 'day' in descending order. For rows near the bottom of the…