How do I prevent data leakage with featuretools

Question

I love featuretools, but I'm having a hard time working it into my data science workflow because I'm concerned about data leakage.

I think that the way to prevent this would be to run deep feature synthesis on the training set, then join the appropriate values to the test set and calculating features only on groups of categories that don't exist in the training set.

Is there a more appropriate way of dealing with leakage?

I have two datasets I'm looking at. One does (and I'll use the cutoff_time feature), the other does not. — Matthew Emery, Apr 08 '18 at 01:34

score 2 · Accepted Answer · edited Jun 20 '19 at 21:48

Featuretools is particularly focused on helping users avoid data leakage or label leakage. There are two ways to deal with data leakage depending on if you have timestamps or not.

Data without timestamps

In the case where you don’t have timestamps, you can create one EntitySet using only the training data and then run ft.dfs. This will create a feature matrix using only the training data, but also return a list of feature definitions. Next, you can create an EntitySet using the test data and recalculate the same features by calling ft.calculate_feature_matrix with the list of feature definitions from before. Here’s is what that flow would look like

In [1]: import featuretools as ft

In [2]: es_train = ft.demo.load_mock_customer(return_entityset=True)

In [3]: feature_matrix, feature_defs = ft.dfs(entityset=es_train,
   ...:                                       target_entity="customers")
   ...: 

In [4]: feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs)

In [5]: feature_matrix_enc
Out[5]: 
             zip_code = 02139  zip_code = 60091  zip_code = unknown  COUNT(transactions)  COUNT(sessions)  SUM(transactions.amount)  MODE(sessions.device) = desktop  MODE(sessions.device) = tablet  MODE(sessions.device) = mobile  MODE(sessions.device) = unknown                   ...                     SUM(sessions.MIN(transactions.amount))  MAX(sessions.SKEW(transactions.amount))  MAX(sessions.MIN(transactions.amount))  SUM(sessions.MEAN(transactions.amount))  STD(sessions.SUM(transactions.amount))  STD(sessions.MEAN(transactions.amount))  SKEW(sessions.MEAN(transactions.amount))  STD(sessions.MAX(transactions.amount))  NUM_UNIQUE(sessions.DAY(session_start))  MIN(sessions.SKEW(transactions.amount))
customer_id                                                                                                                                                                                                                                                                             ...                                                                                                                                                                                                                                                                                                                                                                                                                                          
1                           0                 1                   0                  131               10                  10236.77                                1                               0                               0                                0                   ...                                                     169.77                                 0.610052                                   41.95                               791.976505                              175.939423                                 9.299023                                 -0.377150                                5.857976                                        1                                -0.395358
2                           1                 0                   0                  122                8                   9118.81                                0                               0                               1                                0                   ...                                                     114.85                                 0.492531                                   42.96                               596.243506                              230.333502                                10.925037                                  0.962350                                7.420480                                        1                                -0.470007
3                           1                 0                   0                   78                5                   5758.24                                1                               0                               0                                0                   ...                                                      64.98                                 0.645728                                   21.77                               369.770121                              471.048551                                 9.819148                                 -0.244976                               12.537259                                        1                                -0.630425
4                           0                 1                   0                  111                8                   8205.28                                1                               0                               0                                0                   ...                                                      83.53                                 0.516262                                   17.27                               584.673126                              322.883448                                13.065436                                 -0.548969                               12.738488                                        1                                -0.497169
5                           1                 0                   0                   58                4                   4571.37                                0                               1                               0                                0                   ...                                                      73.09                                 0.830112                                   27.46                               313.448942                              198.522508                                 8.950528                                  0.098885                                5.599228                                        1                                -0.396571

[5 rows x 102 columns]

In [6]: es_test = ft.demo.load_mock_customer(return_entityset=True, random_seed=33)

In [7]: feature_matrix = ft.calculate_feature_matrix(features=features_enc, 
   ...:                                              entityset=es_test)

In [8]: feature_matrix
Out[8]: 
             zip_code = 02139  zip_code = 60091  zip_code = unknown  COUNT(transactions)  COUNT(sessions)  SUM(transactions.amount)  MODE(sessions.device) = desktop  MODE(sessions.device) = tablet  MODE(sessions.device) = mobile  MODE(sessions.device) = unknown                   ...                     SUM(sessions.MIN(transactions.amount))  MAX(sessions.SKEW(transactions.amount))  MAX(sessions.MIN(transactions.amount))  SUM(sessions.MEAN(transactions.amount))  STD(sessions.SUM(transactions.amount))  STD(sessions.MEAN(transactions.amount))  SKEW(sessions.MEAN(transactions.amount))  STD(sessions.MAX(transactions.amount))  NUM_UNIQUE(sessions.DAY(session_start))  MIN(sessions.SKEW(transactions.amount))
customer_id                                                                                                                                                                                                                                                                             ...                                                                                                                                                                                                                                                                                                                                                                                                                                          
1                       False              True               False                  108                7                   8298.18                            False                           False                            True                            False                   ...                                                     145.67                                 0.888409                                   40.48                               541.452307                              264.820242                                11.560551                                 -0.989418                               11.336633                                        1                                -0.193705
2                        True             False               False                   73                5                   5615.36                             True                           False                           False                            False                   ...                                                     106.27                                 0.471924                                   34.93                               380.553253                              420.418805                                 3.513896                                  1.030220                                7.908124                                        1                                -0.191482
3                       False              True               False                   96                7                   8135.65                            False                            True                           False                            False                   ...                                                     160.04                                 0.114599                                   48.71                               581.583008                              377.210618                                12.120119                                  0.130497                               12.869592                                        1                                -0.655836
4                       False              True               False                  140                9                  11240.85                             True                           False                           False                            False                   ...                                                     159.64                                 0.129480                                   29.87                               731.382339                              211.918894                                11.642241                                 -0.271928                                7.969242                                        1                                -0.652966
5                       False              True               False                   83                7                   6781.33                            False                           False                            True                            False                   ...                                                     149.95                                 0.587567                                   60.29                               527.818923                              535.839994                                19.134789                                 -1.195453                               26.460616                                        1                                -0.435026

[5 rows x 102 columns]

Data with timestamps

If your data has timestamps, the best way to prevent leakage is to use a list of “cutoff times”, which specify the last point in time data is allowed to be used for each row in the resulting feature matrix. To use cutoff times, you need to set a time index for each time sensitive entity in your entity set.

Tip: Even if your data doesn’t have time stamps, you could add a column with dummy timestamps that can be used by Featuretools as time index.

When you call ft.dfs, you can provide a dataframe of cutoff times like this.

In [1]: import pandas as pd

In [2]: cutoff_times = pd.DataFrame({"customer_id": [1, 2, 3, 4, 5],
   ...:                             "time": pd.date_range('2014-01-01 01:41:50', periods=5, freq='25min')})
   ...: 

In [3]: cutoff_times
Out[3]: 
   customer_id                time
0            1 2014-01-01 01:41:50
1            2 2014-01-01 02:06:50
2            3 2014-01-01 02:31:50
3            4 2014-01-01 02:56:50
4            5 2014-01-01 03:21:50

In [8]: feature_matrix, features = ft.dfs(entityset=es,
   ...:                                  target_entity="customers",
   ...:                                  cutoff_time=cutoff_times,
   ...:                                  cutoff_time_in_index=True)
   ...: 

In [9]: feature_matrix
Out[9]: 
                                zip_code  COUNT(transactions)  COUNT(sessions)  SUM(transactions.amount) MODE(sessions.device)  MIN(transactions.amount)  MAX(transactions.amount)  YEAR(join_date)  SKEW(transactions.amount)  DAY(join_date)                   ...                     SUM(sessions.MIN(transactions.amount))  MAX(sessions.SKEW(transactions.amount))  MAX(sessions.MIN(transactions.amount))  SUM(sessions.MEAN(transactions.amount))  STD(sessions.SUM(transactions.amount))  STD(sessions.MEAN(transactions.amount))  SKEW(sessions.MEAN(transactions.amount))  STD(sessions.MAX(transactions.amount))  NUM_UNIQUE(sessions.DAY(session_start))  MIN(sessions.SKEW(transactions.amount))
customer_id time                                                                                                                                                                                                                                                 ...                                                                                                                                                                                                                                                                                                                                                                                                                                          
1           2014-01-01 01:41:50    60091                   43                3                   3342.76               desktop                      5.60                    148.14             2008                  -0.024647               1                   ...                                                      22.99                                 0.219871                                    8.72                               238.078662                              155.824474                                 7.762885                              2.850032e-01                                5.224602                                      1.0                                -0.395358
2           2014-01-01 02:06:50    02139                   36                3                   2558.77               desktop                      6.29                    139.23             2008                   0.212373              20                   ...                                                      39.00                                 0.509707                                   25.28                               213.211299                              114.675523                                 4.898920                             -2.392117e-02                                7.035723                                      1.0                                 0.102851
3           2014-01-01 02:31:50    02139                   25                1                   2054.32                mobile                      8.70                    147.73             2008                  -0.215072              10                   ...                                                       8.70                                -0.215072                                    8.70                                82.172800                                0.000000                                 0.000000                              0.000000e+00                                0.000000                                      1.0                                -0.215072
4           2014-01-01 02:56:50    60091                    0                0                       NaN                   NaN                       NaN                       NaN             2008                        NaN              30                   ...                                                        NaN                                      NaN                                     NaN                                      NaN                                     NaN                                      NaN                                       NaN                                     NaN                                      NaN                                      NaN
5           2014-01-01 03:21:50    02139                   29                2                   2296.42                mobile                     20.91                    141.66             2008                   0.167792              19                   ...                                                      48.37                                 0.830112                                   27.46                               157.570000                              208.390000                                11.655000                              1.795202e-15                                4.470000                                      1.0                                -0.396571

[5 rows x 69 columns]

As you can see, there is one row in the resulting feature matrix that was calculated at each specified cutoff time! The concepts of cutoff times and time indices are unique and powerful aspects of Featuretools. For more information, read Handling Time in the documentation.

I have a question regarding "percentile" variables. I have some history data (20k customers), and use it as training set. When it comes to prediction for new data (i.e., a future customer), how come I get the percentile for that person? Thanks — Chau Pham, Jul 02 '19 at 04:21
Could you clarify if in the second case (the one with cut-off time), both test data and train data are used at once to create the features? — Arun, Mar 17 '20 at 10:13

How do I prevent data leakage with featuretools

1 Answers1