I am building a model to predict if a user will purchase a subscription based on his/her read history, etc. (activity). I am using featuretools
(https://www.featuretools.com/) to automate feature engineering and this is where it gets tricky:
How should I decide the cutoff time / window for my training data given that:
- How long should the training window be 1 month, 6 months, etc?
- Given that user activity may be different pre and post subscription, I should cutoff data for current subscribers based on when they subscribed (prevent leakage). But when I should I cutoff for non-subscribers?
feature_matrix, feature_defs = ft.dfs(entityset=es,
target_entity="users",
max_depth=2,
agg_primitives=["sum", "std", "max", "min", "mean", "median", "count", "percent_true", "num_unique", "mode",
"avg_time_between"],
trans_primitives=["day", "year", "month", "weekday", "time_since_previous", "time_since", "is_weekend"],
cutoff_time=cutoff_times,
cutoff_time_in_index=True,
training_window=ft.Timedelta(180,"d"),
n_jobs=8, verbose=True)