Using cutoff_times in featuretools for prediction

Question

I am building a model to predict if a user will purchase a subscription based on his/her read history, etc. (activity). I am using featuretools (https://www.featuretools.com/) to automate feature engineering and this is where it gets tricky:

How should I decide the cutoff time / window for my training data given that:

How long should the training window be 1 month, 6 months, etc?
Given that user activity may be different pre and post subscription, I should cutoff data for current subscribers based on when they subscribed (prevent leakage). But when I should I cutoff for non-subscribers?

feature_matrix, feature_defs = ft.dfs(entityset=es,
                                     target_entity="users",
                                     max_depth=2,
                                     agg_primitives=["sum", "std", "max", "min", "mean", "median", "count", "percent_true", "num_unique", "mode", 
                                                     "avg_time_between"],
                                     trans_primitives=["day", "year", "month", "weekday", "time_since_previous", "time_since", "is_weekend"],
                                     cutoff_time=cutoff_times,
                                     cutoff_time_in_index=True,
                                     training_window=ft.Timedelta(180,"d"),
                                     n_jobs=8, verbose=True)

score 1 · Accepted Answer · answered Oct 28 '19 at 15:54

How you decide the cutoff times for your training data will depend on the following:

How long should the training window be 1 month, 6 months, etc?

I think you can try different training window sizes to see which gives better results with the model.

Given that user activity may be different pre and post subscription, I should cutoff data for current subscribers based on when they subscribed (prevent leakage). But when I should I cutoff for non-subscribers?

I think you can pick them randomly or at times that are representative of when you’re going to use the model on those subscribers in the future.

Our open source library Compose is ideal for structuring this labeling process. If you define your prediction problem in Compose, it will automatically select the negative examples based on how you define the prediction problem. It also has a parameterized prediction window to let you generate labels at specific times. Let me know if this helps.

Thanks Jeff - I just want to express my gratitude to you and your team for the great work at Feature Labs! Amazing work on featuretools and providing great support here as well. I'll definitely check out Compose, hoping to see it as a native feature in featuretools! — Ivan, Oct 30 '19 at 03:46
Thanks Ivan, we appreciate the kind words and we are happy to help! — Jeff Hernandez, Oct 30 '19 at 14:38

Using cutoff_times in featuretools for prediction

1 Answers1