How to speed up featuretools dfs execution?

Question

I am running featuretools to create new features and have created the entitysets from existing dataframe.

The dataframe for training has ~233K records and 81 columns which is split into 3 entities and provided as an input argument to es.dfs command which takes about 2.5 hours of execution time on train dataset and 1.5 hours on test dataset. The test data set size is ~120K with 80 columns.

How can I improve the performance in terms of reducing time to execute? I am running the code on Kaggle Kernel and I lose nearly 4+ hours out of the 9 hours available for a session just running the es.dfs command.

I have referred the code on featuretools website on parallel processing and speeding up the code but it is not very clear on how to go about doing it when entities are created from a dataframe or may be I am not understanding it very clearly.

Execution time reduction by 1/4th time.

the first place to look would be the suggestions in our documentation here: https://docs.featuretools.com/guides/performance.html. — Max Kanter, Apr 19 '19 at 14:53
can you point to the kaggle competition you're applying FT to? that may help give better advice — Max Kanter, Apr 19 '19 at 14:53
Its not a kaggle competition but a competition on another platform and I am using kaggle to run it. — Ganesh Bhat, Apr 20 '19 at 12:21
Have looked at the documentation but is there a way to provide njobs = 1 to ft.dfs? — Ganesh Bhat, Apr 20 '19 at 12:25
njobs = -1 i.e. basically use all processors that are available — Ganesh Bhat, Apr 20 '19 at 12:44
currently when I run DFS command on Kaggle only 100% of CPU is utilized whereas it can go upto 400% which means reduction of time by 1/4th — Ganesh Bhat, Apr 20 '19 at 12:58
yes, you can provide `n_jobs=-1` to `ft.dfs` and `ft.calculate_feature_matrix` — Max Kanter, Apr 21 '19 at 16:43

How to speed up featuretools dfs execution?

0 Answers0