I'm stuck. Using Featuretools, all I want to do is create a new column that sums two columns together from my dataset, creating a "stacked" feature of sorts. Do this for all columns in my dataset.
My code looks like this:
# Define the function
def feature_engineering_dataset(df):
es = ft.EntitySet(id = 'stockdata')
# Make the "Date" index an actual column cuz defining it as the index below throws
# a "can't find Date in index" error for some reason.
df = df.reset_index()
# Save some columns not used in Featuretools to concat back later
dates = df['Date']
tickers = df['Ticker']
dailychange = df['DailyChange']
classes = df['class']
dataframe = df.drop(['Date', 'Ticker', 'DailyChange', 'class'],axis=1)
# Define the entity
es.entity_from_dataframe(entity_id='data', dataframe=dataframe, index='Date') # Won't find Date so uses a numbered index. We'll re-define date as index later
# Pesky warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)
warnings.filterwarnings("once", category=ImportWarning)
# Run deep feature synthesis
feature_matrix, feature_defs = ft.dfs(n_jobs=-2,entityset=es, target_entity='data',
chunk_size=0.015,max_depth=2,verbose=True,
agg_primitives = ['sum'],
trans_primitives = []
)
# Now re-add previous columnes because featuretools...
df = pd.concat([dates, tickers, feature_matrix, dailychange, classes], axis=1)
df = df.set_index(['Date'])
# Return our new dataset!
return(df)
# Now run that defined function
df = feature_engineering_dataset(df)
I'm not sure what's really happening here, but I've defined a depth of 2, so it's my understanding that for every combination of pairs of columns in my dataset, it'll create a new column that sums the two together?
My initial dataframes shape has 3101 columns, and when I run this command it says Built 3098 features
, and the final df has 3098 columns after the concat'ing, which isn't right, it should have all my original features, PLUS the engineered ones.
How can I achieve what I'm after? The examples on the featuretools page and API docs are extremely confusing and deal a lot with dated examples, like "time_since_last" trans primitives and other stuff that doesn't seem to apply here. Thanks!