2

Suppose I have two datasets (corresponding to two entities in my entityset):

First one: customers (cust_id, name, birthdate, customer_since)
Second one: bookings (booking_id, service, chargeamount, booking_date)

Now I want to create a dataset with features built from all customers (no matter since when they are customer) but only bookings from the last two years.

How do I have to use the "last_time_index"? Can I set a "last_time_index" only to one entity? In this case only for the bookings entity, because I want ALL customers, but not all bookings.

If used this code to create the features:

feature_matrix, features = ft.dfs(entityset=es,
                              target_entity="customers",
                              cutoff_time= pd.to_datetime('30/05/2018'),
                              training_window = ft.Timedelta(2*365,"d"),
                              agg_primitives=["count"],
                              trans_primitives=["time_since","year"],
                              cutoff_time_in_index = True)
Flo
  • 233
  • 2
  • 6

1 Answers1

2

The time_index of an entity specifies the first time an instance is valid for use. In that way, the choices you make in setting a time index can impact your final result. Depending on how you set up your time_index, it is possible to use ft.dfs with exactly the settings in your example to get the desired output. Here is a toy example similar to the data you've described:

bookings_df = pd.DataFrame()
bookings_df['booking_id'] = [1, 2, 3, 4]
bookings_df['cust_id'] = [1, 1, 2, 5]
bookings_df['booking_date'] = pd.date_range('1/1/2014', periods=4, freq='Y')

customer_df = pd.DataFrame()
customer_df['cust_id'] = [1, 2, 5]
customer_df['customer_since']  = pd.to_datetime(['2014-01-01', '2016-01-01', '2017-01-01'])

es = ft.EntitySet('Bookings')
es.entity_from_dataframe('bookings', bookings_df, 'booking_id', time_index='booking_date')
es.entity_from_dataframe('customers', customer_df, 'cust_id')

es.add_relationship(ft.Relationship(es['customers']['cust_id'], es['bookings']['cust_id']))

We have set up our bookings_df with one event a year for the past four years. The dataframe looks like this:

    booking_id  cust_id  booking_date
0    1           1        2014-12-31
1    2           1        2015-12-31
2    3           2        2016-12-31
3    4           5        2017-12-31

Notice that we have not set the time index for customers, meaning that all customers data is always valid for use. Running DFS without the training_window argument will return

         YEAR(customer_since)   COUNT(bookings)
cust_id     
1         2014                   2.0
2         2016                   1.0
5         2017                   1.0

while by adding that the training_window of two years (as in your example), we only see results using two of the previous four bookings:

         YEAR(customer_since)   COUNT(bookings)
cust_id     
1         2014                   0.0
2         2016                   1.0
5         2017                   1.0
Seth Rothschild
  • 384
  • 1
  • 14