2

I'm wondering if there is a way to automatically select the amount of past data when calculating features.

For example, I might want to predict when a customer is going to make their next purchase, so it would be good to know a count of purchases or average purchase price by different date cutoffs. e.g. Purchases in the last 12 months, last 3 months, 7 days etc.

What is the best way to approach this with featuretools?

Max Kanter
  • 2,006
  • 6
  • 16
tsp
  • 43
  • 3

1 Answers1

1

You can create a feature matrix thats uses only a certain amount of historical data using the training window parameter in featuretools.dfs. When training window is set, Featuretools will use the historical data between the cutoff time and cutoff_time - training_window. Here's the example from the documentation:

window_fm, window_features = ft.dfs(entityset=es,
                                    target_entity="customers",
                                    cutoff_time=cutoff_times,
                                    cutoff_time_in_index=True,
                                    training_window="1 hour")

When determining which data is valid for use, the training window will check if the time in the time_index column is within the training window.

Max Kanter
  • 2,006
  • 6
  • 16
  • So I'm guessing if I wanted to use multiple training windows (1 week, 1 month, 6 months, 1 year etc.) I would need to run dfs multiple times with the different training windows? And if I wanted those to be discrete (>= 1 week and < 1 month, >= 1 month and <6 months), I'd need to alter the cutoff times as well? Thanks Max! – tsp Jun 26 '18 at 22:34
  • Yep, that's how to do it. If it helps, you can include a given customer id twice in your cutoff times dataframe if the cutoff times themselves aren't the same. It would return two feature vectors for the customer, but calculated at the two specified times. – Max Kanter Jun 27 '18 at 22:50