2

I am creating a featuretools matrix, that is generated out of 5 dataframe entities and one cutoff_time table. When I use the ft.dfs() function, I use both agg_primitives and trans_primitives, however all the primitives in trans_primitives that relate to a datetime column do not generate any feature.

The entity that includes the datetime column is called 'events'. The name of the column is 'event-timestamp'.

Since my trans_primitives list includes other primitives that do generate features ("IS_NULL" works), I assume that the problem is not with how I use the trans_primitives as a whole, but only those that relate to time.

Some things that may assist:

  1. I checked the dtype of the column 'event-timestamp' in 'events' and it is datetime64[ns]. The same is with the 'cutoff_time' column in the cutoff table.

  2. Another detail is that some new features of 'event-timestamp' are generated by the agg_primitives (for example 'MIN(matcher.devices.TIME_SINCE_LAST(events.event-timestamp))'), so I guess it shows that the column itself is OK.

  3. I did some experiments with the es.entity_from_dataframe of 'events':

    • Used the argument: time_index='event-timestamp'
    • Used the argument: variable_types={'event-timestamp': vtypes.Datetime}
    • Used both of the above and none of them

Below is the functions that I am using:

def generate_feature_matrix(events, grns, contracts, om_table, matcher, customers):
    """
    The function takes a set of tables, creates featuretools entities and 
    relationships and then creates the final feature matrix"""


    ## Make empty entityset
    es = ft.EntitySet(id = 'contracts_customers')


    ## Create entities
    # events
    es.entity_from_dataframe(entity_id='events', dataframe=events, index='index', make_index=True,
                             time_index='event-timestamp') # tried also variable_types={'event-timestamp': vtypes.DatetimeTimeIndex} 
    # Devices
    es.entity_from_dataframe(entity_id='contracts', dataframe=contracts, index='contract')
    # Matcher
    es.entity_from_dataframe(entity_id='matcher', dataframe=matcher, index = 'contract', 
                             make_index=False)
    # os_table
    es.entity_from_dataframe(entity_id='om_table', dataframe=om_table, index='index', 
                             make_index=True)
    # Users
    es.entity_from_dataframe(entity_id='customers', dataframe=customers, index='customer')




    # Relationships (parent, child)
    r_devices_matcher = ft.Relationship(es['contracts']['contract'], es['matcher']['contract'])
    r_devices_events = ft.Relationship(es['contracts']['contract'], es['events']['contract'])
    r_devices_os = ft.Relationship(es['contracts']['contract'], es['om_table']['contract'])
    r_users_matcher = ft.Relationship(es['customers']['customer'], es['matcher']['customer'])

    es.add_relationships([r_devices_matcher, r_devices_events, r_users_matcher, r_devices_os])

    # Primitives
    agg_primitives=["num_unique", "skew", "mean", "count", "median", "sum",
                    "time_since_last", "mode", "min"] 

    trans_primitives=['month', 'weekday','hour', "time_since", "time_since_previous",
                      'is_null']

    # Generate the features
    feature_defs = ft.dfs(entityset=es, target_entity='customers', 
                                          cutoff_time = grns, 
                                          agg_primitives = agg_primitives,
                                          trans_primitives = trans_primitives,
                                          max_depth = 3, features_only = True,
                                          chunk_size = len(grns),  
                                          )



    return feature_defs

The entity relationship looks like that:

os
Out[392]: 
Entityset: contracts_customers
  Entities:
    events [Rows: 22, Columns: 3]
    contracts [Rows: 35, Columns: 2]
    matcher [Rows: 2663, Columns: 2]
    om_table [Rows: 965, Columns: 4]
    customers [Rows: 76, Columns: 2]
  Relationships:
    matcher.contract -> contracts.contract
    events.contract -> contracts.contract
    matcher.customer -> customers.customer
    om_table.contract -> contracts.contract

And the generated feature list:

new_features
Out[393]: 
[<Feature: n_contracts>,
 <Feature: NUM_UNIQUE(matcher.contract)>,
 <Feature: MODE(matcher.contract)>,
 <Feature: IS_NULL(customer)>,
 <Feature: IS_NULL(n_contracts)>,
 <Feature: SKEW(matcher.contracts.n_event)>,
 <Feature: MEAN(matcher.contracts.n_event)>,
 <Feature: MEDIAN(matcher.contracts.n_event)>,
 <Feature: SUM(matcher.contracts.n_event)>,
 <Feature: MIN(matcher.contracts.n_event)>,
 <Feature: IS_NULL(NUM_UNIQUE(matcher.contract))>,
 <Feature: IS_NULL(MODE(matcher.contract))>,
 <Feature: NUM_UNIQUE(matcher.contracts.MODE(matcher.customer))>,
 <Feature: NUM_UNIQUE(matcher.contracts.MODE(om_table.om_family))>,
 <Feature: SKEW(matcher.contracts.COUNT(events))>,
 <Feature: SKEW(matcher.contracts.TIME_SINCE_LAST(events.event-timestamp))>,
 <Feature: SKEW(matcher.contracts.NUM_UNIQUE(matcher.customer))>,
 <Feature: SKEW(matcher.contracts.NUM_UNIQUE(om_table.om_family))>,
 <Feature: SKEW(matcher.contracts.SKEW(om_table.n_events))>,
 <Feature: SKEW(matcher.contracts.MEAN(om_table.n_events))>,
 <Feature: SKEW(matcher.contracts.COUNT(om_table))>,
 <Feature: SKEW(matcher.contracts.MEDIAN(om_table.n_events))>,
 <Feature: SKEW(matcher.contracts.SUM(om_table.n_events))>,
 <Feature: SKEW(matcher.contracts.MIN(om_table.n_events))>,
 <Feature: MEAN(matcher.contracts.COUNT(events))>,
 <Feature: MEAN(matcher.contracts.TIME_SINCE_LAST(events.event-timestamp))>,
 <Feature: MEAN(matcher.contracts.NUM_UNIQUE(matcher.customer))>,
 <Feature: MEAN(matcher.contracts.NUM_UNIQUE(om_table.om_family))>,
 <Feature: MEAN(matcher.contracts.SKEW(om_table.n_events))>,
 <Feature: MEAN(matcher.contracts.MEAN(om_table.n_events))>,
 <Feature: MEAN(matcher.contracts.COUNT(om_table))>,
 <Feature: MEAN(matcher.contracts.MEDIAN(om_table.n_events))>,
 <Feature: MEAN(matcher.contracts.SUM(om_table.n_events))>,
 <Feature: MEAN(matcher.contracts.MIN(om_table.n_events))>,
 <Feature: MEDIAN(matcher.contracts.COUNT(events))>,
 <Feature: MEDIAN(matcher.contracts.TIME_SINCE_LAST(events.event-timestamp))>,
 <Feature: MEDIAN(matcher.contracts.NUM_UNIQUE(matcher.customer))>,
 <Feature: MEDIAN(matcher.contracts.NUM_UNIQUE(om_table.om_family))>,
 <Feature: MEDIAN(matcher.contracts.SKEW(om_table.n_events))>,
 <Feature: MEDIAN(matcher.contracts.MEAN(om_table.n_events))>,
 <Feature: MEDIAN(matcher.contracts.COUNT(om_table))>,
 <Feature: MEDIAN(matcher.contracts.MEDIAN(om_table.n_events))>,
 <Feature: MEDIAN(matcher.contracts.SUM(om_table.n_events))>,
 <Feature: MEDIAN(matcher.contracts.MIN(om_table.n_events))>,
 <Feature: SUM(matcher.contracts.COUNT(events))>,
 <Feature: SUM(matcher.contracts.TIME_SINCE_LAST(events.event-timestamp))>,
 <Feature: SUM(matcher.contracts.NUM_UNIQUE(matcher.customer))>,
 <Feature: SUM(matcher.contracts.NUM_UNIQUE(om_table.om_family))>,
 <Feature: SUM(matcher.contracts.SKEW(om_table.n_events))>,
 <Feature: SUM(matcher.contracts.MEAN(om_table.n_events))>,
 <Feature: SUM(matcher.contracts.COUNT(om_table))>,
 <Feature: SUM(matcher.contracts.MEDIAN(om_table.n_events))>,
 <Feature: SUM(matcher.contracts.SUM(om_table.n_events))>,
 <Feature: SUM(matcher.contracts.MIN(om_table.n_events))>,
 <Feature: MODE(matcher.contracts.MODE(matcher.customer))>,
 <Feature: MODE(matcher.contracts.MODE(om_table.om_family))>,
 <Feature: MIN(matcher.contracts.COUNT(events))>,
 <Feature: MIN(matcher.contracts.TIME_SINCE_LAST(events.event-timestamp))>,
 <Feature: MIN(matcher.contracts.NUM_UNIQUE(matcher.customer))>,
 <Feature: MIN(matcher.contracts.NUM_UNIQUE(om_table.om_family))>,
 <Feature: MIN(matcher.contracts.SKEW(om_table.n_events))>,
 <Feature: MIN(matcher.contracts.MEAN(om_table.n_events))>,
 <Feature: MIN(matcher.contracts.COUNT(om_table))>,
 <Feature: MIN(matcher.contracts.MEDIAN(om_table.n_events))>,
 <Feature: MIN(matcher.contracts.SUM(om_table.n_events))>,
 <Feature: MIN(matcher.contracts.MIN(om_table.n_events))>,
 <Feature: IS_NULL(SKEW(matcher.contracts.n_event))>,
 <Feature: IS_NULL(MEAN(matcher.contracts.n_event))>,
 <Feature: IS_NULL(MEDIAN(matcher.contracts.n_event))>,
 <Feature: IS_NULL(SUM(matcher.contracts.n_event))>,
 <Feature: IS_NULL(MIN(matcher.contracts.n_event))>]

I expect to get new features that yield from all the trans_primitives list above.

IshayT
  • 21
  • 3
  • can you run `es.plot()` and upload the picture of your schema? also, can you share the list of features that is getting generated? – Max Kanter Jun 25 '19 at 14:36
  • Unfortunately I still do not have enough credit for uploading images, but I added the entity relationship print. Also the list of generated featured. – IshayT Jun 26 '19 at 10:19
  • 2
    I managed to solve the issue. I just increased the max_depth to 4 instead of 3 and I got all the desired variables. It seems that my understanding about the max_depth was not complete. – IshayT Jul 08 '19 at 07:28
  • @IshayT I was struggling a lot with this same issue. thanks for sharing your answer. – Mahsa Seifikar Nov 11 '19 at 12:37

1 Answers1

0

What does it say for the variable type for the 'event-timestamp' column in es.plot()? From what you said about 'time_since_last', I doubt this is the issue however.

Also, when you change the target entity from 'customers' to 'events', does the problem still persist? It's hard to tell exactly without seeing the schema, but I'm guessing that 'events' and 'customers' are not related in a way within the EntitySet such that the primitives are calculating the features you want. Try changing the target entity and looking at the features created. If there still aren't any of the datetime trans_primitives, then it's a different problem than what I'm thinking.

EDIT: Replicated similar behavior:

import featuretools as ft
from featuretools.tests.testing_utils import make_ecommerce_entityset

es = make_ecommerce_entityset()
es.plot()

features = ft.dfs(entityset=es,
                  target_entity="stores",
                  features_only=True,
                  max_depth=3)

features

The features that related to "cohorts" are:

<Feature: régions.MODE(customers.cohorts.cohort_name)>
<Feature: régions.NUM_UNIQUE(customers.cohorts.cohort_name)>,

Notice that here, the primitives are not being applied to the values of cohorts to generate new features either.

I think what's happening is that events and customers are too indirectly related. customers and contracts share child matcher, while events is a child of contracts. In the above example, when this happens, it does not calculate the new features for these entities.

I believe the defined behavior is to apply the primitives to the target entity and immediately related entities. And here, because the entities are too indirectly related (if you look at the above example, sessions is not calculated either as well as cohorts), the primitives are not applied to its values until you increase max_depth.

  • Thanks Alexander Wang. the variable type of 'event-timestamp' is datetime_time_index. I tried replacing the target to 'events', it did produced the trans_primitives, but I lost other important features I had with 'customers' as target. Another problem is that couldn't produced the matrix with the 'events' as target since the index column of 'events' is not in the cutoff table, and cannot be since the cutoff is on customer level not event level. – IshayT Jul 02 '19 at 06:40
  • 2
    I managed to solve the issue. I just increased the max_depth to 4 instead of 3 and I got all the desired variables. It seems that my understanding about the max_depth was not complete. – IshayT Jul 08 '19 at 07:28