0

I am trying to reproduce the featuretools tutorial (See link below). I am using the mocking data provided in the package. They include a customers table and a sessions table. Every customer has many sessions. Every session has a session_start timestamp. I compute the mean of the primitive time_since_previous of the feature session_start a) using featuretools and b) manually. But I get different results, where am I wrong?

a) Calculation using featuretools:

import featuretools as ft

es = ft.demo.load_mock_customer(return_entityset=True)
feature_matrix, features_defs = ft.dfs(
    entityset=es,
    target_entity='customers',
    agg_primitives=['mean'],
    trans_primitives=['time_since_previous'])

The MEAN(sessions.TIME_SINCE_PREVIOUS(session_start)) for customer 3 is 888.333333

b) Manual calculation:

time_since_previous(sessions[sessions.customer_id == 3].session_start).tolist()
[nan, 10075.0, 3900.0, 1625.0, 8710.0, 1170.0]
statistics.mean([ 10075.0, 3900.0, 1625.0, 8710.0, 1170.0])
5096.0

Snapshot1 Snapshot2

https://docs.featuretools.com/en/stable/automated_feature_engineering/primitives.html

1 Answers1

0

To apply time_since_previous for each customer, you can use groupby_trans_primitives in DFS.

fm, fd = ft.dfs(
    entityset=es,
    target_entity='customers',
    agg_primitives=['mean'],
    groupby_trans_primitives=['time_since_previous'],
)

fm.filter(regex='sessions.TIME_SINCE_PREVIOUS')
             MEAN(sessions.TIME_SINCE_PREVIOUS(session_start) by customer_id)
customer_id                                                                  
5                                                  5577.000000               
4                                                  2516.428571               
1                                                  3305.714286               
3                                                  5096.000000               
2                                                  4907.500000               

Let me know if this helps.

Jeff Hernandez
  • 2,063
  • 16
  • 20