Featuretools is particularly focused on helping users avoid data leakage or label leakage. There are two ways to deal with data leakage depending on if you have timestamps or not.
Data without timestamps
In the case where you don’t have timestamps, you can create one EntitySet
using only the training data and then run ft.dfs
. This will create a feature matrix using only the training data, but also return a list of feature definitions. Next, you can create an EntitySet
using the test data and recalculate the same features by calling ft.calculate_feature_matrix
with the list of feature definitions from before. Here’s is what that flow would look like
In [1]: import featuretools as ft
In [2]: es_train = ft.demo.load_mock_customer(return_entityset=True)
In [3]: feature_matrix, feature_defs = ft.dfs(entityset=es_train,
...: target_entity="customers")
...:
In [4]: feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs)
In [5]: feature_matrix_enc
Out[5]:
zip_code = 02139 zip_code = 60091 zip_code = unknown COUNT(transactions) COUNT(sessions) SUM(transactions.amount) MODE(sessions.device) = desktop MODE(sessions.device) = tablet MODE(sessions.device) = mobile MODE(sessions.device) = unknown ... SUM(sessions.MIN(transactions.amount)) MAX(sessions.SKEW(transactions.amount)) MAX(sessions.MIN(transactions.amount)) SUM(sessions.MEAN(transactions.amount)) STD(sessions.SUM(transactions.amount)) STD(sessions.MEAN(transactions.amount)) SKEW(sessions.MEAN(transactions.amount)) STD(sessions.MAX(transactions.amount)) NUM_UNIQUE(sessions.DAY(session_start)) MIN(sessions.SKEW(transactions.amount))
customer_id ...
1 0 1 0 131 10 10236.77 1 0 0 0 ... 169.77 0.610052 41.95 791.976505 175.939423 9.299023 -0.377150 5.857976 1 -0.395358
2 1 0 0 122 8 9118.81 0 0 1 0 ... 114.85 0.492531 42.96 596.243506 230.333502 10.925037 0.962350 7.420480 1 -0.470007
3 1 0 0 78 5 5758.24 1 0 0 0 ... 64.98 0.645728 21.77 369.770121 471.048551 9.819148 -0.244976 12.537259 1 -0.630425
4 0 1 0 111 8 8205.28 1 0 0 0 ... 83.53 0.516262 17.27 584.673126 322.883448 13.065436 -0.548969 12.738488 1 -0.497169
5 1 0 0 58 4 4571.37 0 1 0 0 ... 73.09 0.830112 27.46 313.448942 198.522508 8.950528 0.098885 5.599228 1 -0.396571
[5 rows x 102 columns]
In [6]: es_test = ft.demo.load_mock_customer(return_entityset=True, random_seed=33)
In [7]: feature_matrix = ft.calculate_feature_matrix(features=features_enc,
...: entityset=es_test)
In [8]: feature_matrix
Out[8]:
zip_code = 02139 zip_code = 60091 zip_code = unknown COUNT(transactions) COUNT(sessions) SUM(transactions.amount) MODE(sessions.device) = desktop MODE(sessions.device) = tablet MODE(sessions.device) = mobile MODE(sessions.device) = unknown ... SUM(sessions.MIN(transactions.amount)) MAX(sessions.SKEW(transactions.amount)) MAX(sessions.MIN(transactions.amount)) SUM(sessions.MEAN(transactions.amount)) STD(sessions.SUM(transactions.amount)) STD(sessions.MEAN(transactions.amount)) SKEW(sessions.MEAN(transactions.amount)) STD(sessions.MAX(transactions.amount)) NUM_UNIQUE(sessions.DAY(session_start)) MIN(sessions.SKEW(transactions.amount))
customer_id ...
1 False True False 108 7 8298.18 False False True False ... 145.67 0.888409 40.48 541.452307 264.820242 11.560551 -0.989418 11.336633 1 -0.193705
2 True False False 73 5 5615.36 True False False False ... 106.27 0.471924 34.93 380.553253 420.418805 3.513896 1.030220 7.908124 1 -0.191482
3 False True False 96 7 8135.65 False True False False ... 160.04 0.114599 48.71 581.583008 377.210618 12.120119 0.130497 12.869592 1 -0.655836
4 False True False 140 9 11240.85 True False False False ... 159.64 0.129480 29.87 731.382339 211.918894 11.642241 -0.271928 7.969242 1 -0.652966
5 False True False 83 7 6781.33 False False True False ... 149.95 0.587567 60.29 527.818923 535.839994 19.134789 -1.195453 26.460616 1 -0.435026
[5 rows x 102 columns]
Data with timestamps
If your data has timestamps, the best way to prevent leakage is to use a list of “cutoff times”, which specify the last point in time data is allowed to be used for each row in the resulting feature matrix. To use cutoff times, you need to set a time index for each time sensitive entity in your entity set.
Tip: Even if your data doesn’t have time stamps, you could add a column with dummy timestamps that can be used by Featuretools as time index.
When you call ft.dfs
, you can provide a dataframe of cutoff times like this.
In [1]: import pandas as pd
In [2]: cutoff_times = pd.DataFrame({"customer_id": [1, 2, 3, 4, 5],
...: "time": pd.date_range('2014-01-01 01:41:50', periods=5, freq='25min')})
...:
In [3]: cutoff_times
Out[3]:
customer_id time
0 1 2014-01-01 01:41:50
1 2 2014-01-01 02:06:50
2 3 2014-01-01 02:31:50
3 4 2014-01-01 02:56:50
4 5 2014-01-01 03:21:50
In [8]: feature_matrix, features = ft.dfs(entityset=es,
...: target_entity="customers",
...: cutoff_time=cutoff_times,
...: cutoff_time_in_index=True)
...:
In [9]: feature_matrix
Out[9]:
zip_code COUNT(transactions) COUNT(sessions) SUM(transactions.amount) MODE(sessions.device) MIN(transactions.amount) MAX(transactions.amount) YEAR(join_date) SKEW(transactions.amount) DAY(join_date) ... SUM(sessions.MIN(transactions.amount)) MAX(sessions.SKEW(transactions.amount)) MAX(sessions.MIN(transactions.amount)) SUM(sessions.MEAN(transactions.amount)) STD(sessions.SUM(transactions.amount)) STD(sessions.MEAN(transactions.amount)) SKEW(sessions.MEAN(transactions.amount)) STD(sessions.MAX(transactions.amount)) NUM_UNIQUE(sessions.DAY(session_start)) MIN(sessions.SKEW(transactions.amount))
customer_id time ...
1 2014-01-01 01:41:50 60091 43 3 3342.76 desktop 5.60 148.14 2008 -0.024647 1 ... 22.99 0.219871 8.72 238.078662 155.824474 7.762885 2.850032e-01 5.224602 1.0 -0.395358
2 2014-01-01 02:06:50 02139 36 3 2558.77 desktop 6.29 139.23 2008 0.212373 20 ... 39.00 0.509707 25.28 213.211299 114.675523 4.898920 -2.392117e-02 7.035723 1.0 0.102851
3 2014-01-01 02:31:50 02139 25 1 2054.32 mobile 8.70 147.73 2008 -0.215072 10 ... 8.70 -0.215072 8.70 82.172800 0.000000 0.000000 0.000000e+00 0.000000 1.0 -0.215072
4 2014-01-01 02:56:50 60091 0 0 NaN NaN NaN NaN 2008 NaN 30 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 2014-01-01 03:21:50 02139 29 2 2296.42 mobile 20.91 141.66 2008 0.167792 19 ... 48.37 0.830112 27.46 157.570000 208.390000 11.655000 1.795202e-15 4.470000 1.0 -0.396571
[5 rows x 69 columns]
As you can see, there is one row in the resulting feature matrix that was calculated at each specified cutoff time! The concepts of cutoff times and time indices are unique and powerful aspects of Featuretools. For more information, read Handling Time in the documentation.