I tried using featuretools with timestamps to use past decisions of distributors as a predictive variable. I have only one dataset as input, with a typical binary classification problem (with n rows). There are set of distributors (<
It is very important to respect to the timestamps ordering in calculating the mean labels associated with a distributors at each timestamp, to avoid information leakage.
Here is how I would do it with Pandas:
import pandas as pd
import numpy as np
from datetime import datetime
import featuretools as ft
timestamps = ['2019-01-05-10:36:12', '2019-01-04-11:32:12', '2019-01-03-08:01:03', '2019-01-03-06:32:54',
'2019-01-01-07:30:24', '2018-12-20-04:20:25']
time = [datetime.strptime(x,'%Y-%m-%d-%H:%M:%S') for x in timestamps]
data = {'time': time,
'Distributor': ['A','B','A','B','B','B'],
'Label': [1, 0, 0, 0, 0, 1]}
# Create DataFrame
df = pd.DataFrame(data)
df = df.sort_values(['Distributor','time'])
def past70(g):
g = g.set_index('time').resample('D').last()
g['Past_average_label_per_distributor'] = g['Label'].rolling(70, 0).mean().shift(1)
return g[g.Label.notnull()]
df = df.groupby('Distributor').apply(past70)
df
Now doing this tedious with pandas, as I would like to use many primitives to my problem (say I want also standard deviation of past labels per distributors, but also many other variables grouped_by distributors calculated with a time window)
Here is a failed attempt with featuretools:
import pandas as pd
import numpy as np
from datetime import datetime
import featuretools as ft
timestamps = ['2019-01-05-10:36:12', '2019-01-04-11:32:12', '2019-01-03-08:01:03', '2019-01-03-06:32:54',
'2019-01-01-07:30:24', '2018-12-20-04:20:25']
time = [datetime.strptime(x,'%Y-%m-%d-%H:%M:%S') for x in timestamps]
data = {'time': time,
'Distributor': ['A','B','A','B','B','B'],
'Label': [1, 0, 0, 0, 0, 1]}
# Create DataFrame
df = pd.DataFrame(data)
df = df.sort_values(['Distributor','time'])
cutoff_times = pd.DataFrame({
"index": df.index,
"cutoff_time": df['time']
})
es = ft.EntitySet(id='Sales')
es.entity_from_dataframe(entity_id='Sales', dataframe=df, index='index', make_index=True, time_index='time')
es = es.normalize_entity(base_entity_id='Sales', new_entity_id='Distributors', index='Distributor')
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity='Sales',
cutoff_time=cutoff_times,
where_primitives=['mean'], features_only=False,
cutoff_time_in_index=False)
feature_matrix # not correct
Any one would have any lead on how to achieve that ? Can't seem to find anything similar in the documentation. Yet this seems to to be a pretty common thing in Machine Learning pre-processing.