2

The execution of both ft.dfs(...) and ft.calculate_feature_matrix(...) on some time series to extract the day month and year from a very small dataframe (<1k rows) takes about 800ms. When I compute no features at all, it still takes about 750ms. What is causing this overhead and how can I reduce it?

I've testing different combinations of features as well as testing it on a bunch of small dataframes, and the execution time is pretty constant at 700-800ms.

I've also tested it on much larger dataframes with >1million rows. The execution time without any actual features (primitives) is pretty comparable at around to that with all the date features at around 80-90 seconds. So it seems like the computation time depends on the number of rows but not on the features?

I'm running with a n_jobs=1 to avoid any weirdness with parallelism. It seems to me like featuretools is doing some configuration or setup for the dask back-end every time and that is causing all of the overhead.

es = ft.EntitySet(id="testing")
es = es.entity_from_dataframe(
    entity_id="time_series",
    make_index=True,
    dataframe=df_series[[
        "date",
        "flag_1",
        "flag_2",
        "flag_3",
        "flag_4"
    ]],
    variable_types={},
    index="id",
    time_index="date"
)

print(len(data))

features = ft.dfs(entityset=es, target_entity="sales", agg_primitives=[], trans_primitives=[])

The actual output seems to be correct, I am just surprised that FeatureTools would take 800ms to compute nothing on a small dataframe. Is the solution simply to avoid small dataframes and compute everything with a custom primitive on a large dataframe to mitigate the overhead? Or is there a smarter/more correct way of using ft.dfs(...) or ft.compute_feature_matrix.

Philliams
  • 21
  • 3

0 Answers0