1

I'm stuck. Using Featuretools, all I want to do is create a new column that sums two columns together from my dataset, creating a "stacked" feature of sorts. Do this for all columns in my dataset.

My code looks like this:

# Define the function
def feature_engineering_dataset(df):

    es = ft.EntitySet(id = 'stockdata')
    
    # Make the "Date" index an actual column cuz defining it as the index below throws
    # a "can't find Date in index" error for some reason.
    df = df.reset_index()

    # Save some columns not used in Featuretools to concat back later
    dates = df['Date']
    tickers = df['Ticker']
    dailychange = df['DailyChange']
    classes = df['class']

    dataframe = df.drop(['Date', 'Ticker', 'DailyChange', 'class'],axis=1)

    # Define the entity
    es.entity_from_dataframe(entity_id='data', dataframe=dataframe, index='Date') # Won't find Date so uses a numbered index. We'll re-define date as index later

    # Pesky warnings
    warnings.filterwarnings("ignore", category=RuntimeWarning) 
    warnings.filterwarnings("once", category=ImportWarning)

    # Run deep feature synthesis
    feature_matrix, feature_defs = ft.dfs(n_jobs=-2,entityset=es, target_entity='data', 
                                           chunk_size=0.015,max_depth=2,verbose=True,
                    agg_primitives = ['sum'],
                    trans_primitives = []
                    ) 

    # Now re-add previous columnes because featuretools...
    df = pd.concat([dates, tickers, feature_matrix, dailychange, classes], axis=1)
    
    df = df.set_index(['Date'])
    
    # Return our new dataset!
    return(df)

# Now run that defined function
df = feature_engineering_dataset(df)

I'm not sure what's really happening here, but I've defined a depth of 2, so it's my understanding that for every combination of pairs of columns in my dataset, it'll create a new column that sums the two together?

My initial dataframes shape has 3101 columns, and when I run this command it says Built 3098 features, and the final df has 3098 columns after the concat'ing, which isn't right, it should have all my original features, PLUS the engineered ones.

How can I achieve what I'm after? The examples on the featuretools page and API docs are extremely confusing and deal a lot with dated examples, like "time_since_last" trans primitives and other stuff that doesn't seem to apply here. Thanks!

wildcat89
  • 1,159
  • 16
  • 47

1 Answers1

3

Thanks for the question. You can create a new column that sums two columns by using the transform primitive add_numeric. I'll go through a quick example using this data.

id                time      open      high       low     close
 0 2019-07-10 07:00:00  1.053362  1.053587  1.053147  1.053442
 1 2019-07-10 08:00:00  1.053457  1.054057  1.053457  1.053987
 2 2019-07-10 09:00:00  1.053977  1.054192  1.053697  1.053917
 3 2019-07-10 10:00:00  1.053902  1.053907  1.053522  1.053557
 4 2019-07-10 11:00:00  1.053567  1.053627  1.053327  1.053397

First, we create the entity set for the data.

import featuretools as ft

es = ft.EntitySet('stockdata')

es.entity_from_dataframe(
    entity_id='data',
    dataframe=df,
    index='id',
    time_index='time',
)

Now, we apply DFS using the transform primitive to add the numeric columns.

feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_entity='data',
    trans_primitives=['add_numeric'],
)

Then, the new engineered features are returned along with the original ones.

feature_matrix
        open      high       low     close  close + high  low + open  high + low  close + open  high + open  close + low
id
0   1.053362  1.053587  1.053147  1.053442      2.107029    2.106509    2.106734      2.106804     2.106949     2.106589
1   1.053457  1.054057  1.053457  1.053987      2.108044    2.106914    2.107514      2.107444     2.107514     2.107444
2   1.053977  1.054192  1.053697  1.053917      2.108109    2.107674    2.107889      2.107894     2.108169     2.107614
3   1.053902  1.053907  1.053522  1.053557      2.107464    2.107424    2.107429      2.107459     2.107809     2.107079
4   1.053567  1.053627  1.053327  1.053397      2.107024    2.106894    2.106954      2.106964     2.107194     2.106724

You can see a list of all the built-in primitives by calling the function ft.list_primitives().

Jeff Hernandez
  • 2,063
  • 16
  • 20
  • 1
    Thank you! I knew I was close. I'll add the ft.list_primitives() to my workflow so I can familiarize myself with the transformations. – wildcat89 Jul 16 '20 at 02:20