1

I tried add a cutoff_time dataframe to the dfs function of featuretool so that each row in my dataframe will have a specific cutoff time.But I cannot make the cutoff_time dataframe work as expected

The documentation said that the first column of cutoff_time should be 'instance_id'.I'm not sure what that means..I tried both the index from the target entity(transaction)and the customer_id(id) from the target entity(transaction). both of them will mess up the feature_matrix

merkle_fake = pd.DataFrame()

transaction_fake['order_date'] = ['2019-01-01','2018-01-01','2017-01-01','2018-05-23','2019-02-02','2018-12-21']
transaction_fake['product_category'] = ['cat2','cat3','cat2','cat1','cat2','cat2']
transaction_fake['id'] = ['1','2','1','3','1','2']
transaction_fake['order_date'] = pd.to_datetime(transaction_fake['order_date'])
transaction_fake['index'] = transaction_fake.index.tolist()

merkle_fake['cust_id'] = ['1','2','3']

es_demo = ft.EntitySet(id = 'demo')
es_demo.entity_from_dataframe(entity_id='transaction', dataframe= transaction_fake,time_index = 'order_date',index = 'index')
es_demo.entity_from_dataframe(entity_id='merkle', dataframe= merkle_fake,index = 'cust_id')
relationship_fake = ft.Relationship(es_demo["merkle"]["cust_id"],es_demo["transaction"]["id"])
es_demo = es_demo.add_relationship(relationship_fake)

cutoff_times_demo = pd.DataFrame()
cutoff_times_demo['instance_id'] = es_demo['transaction'].df['id']
cutoff_times_demo['time'] = es_demo['transaction'].df['order_date']

feature_matrix_demo, feature_defs_demo = ft.dfs(entityset=es_demo,
                                               agg_primitives=['count'],
                                               trans_primitives=[],
                                               target_entity='transaction',
                                               cutoff_time= cutoff_times_demo,
                                               features_only = False)

feature_matrix_demo 

I expect the output will be like this:

    product_category    id  merkle.COUNT(transaction)
2   cat2    1   1
1   cat3    2   1
3   cat1    3   1
5   cat2    2   2
0   cat2    1   2
4   cat2    1   3

But it gives me:

    product_category    id  merkle.COUNT(transaction)
index           
1   NaN NaN 0
2   cat2    1   1
3   cat1    3   1
2   cat2    1   1
1   cat3    2   2
1   cat3    2   2
Adam Li
  • 13
  • 2

1 Answers1

0

When you pass in a DataFrame with ‘instance_id’ and ‘time’ columns for cutoff_time, dfs will calculate each instance, identified by its ‘instance_id,’ up to and including the corresponding ‘time’. The ‘instance_id’ identifies the row in the target entity.

Hence, when you pass in es_demo['transaction'].df['id'] for cutoff_times_demo['instance_id'], you are telling dfs() to calculate row 1 at ‘2017-01-01’, row 2 at ‘2018-01-01’, row 3 at ‘2018-05-23’, row 2 at ‘2018-12-21’, etc. This produces a NaN value for the first row in the returned feature_matrix because there is no data for row 1 before and up to 2017-01-01.

To get the output you expected, change what you set your instance_id column to:

cutoff_times_demo['instance_id'] = es_demo['transaction'].df['index']

Angela Lin
  • 16
  • 1