2

I tried using featuretools with timestamps to use past decisions of distributors as a predictive variable. I have only one dataset as input, with a typical binary classification problem (with n rows). There are set of distributors (<

It is very important to respect to the timestamps ordering in calculating the mean labels associated with a distributors at each timestamp, to avoid information leakage.

Here is how I would do it with Pandas:

import pandas as pd
import numpy as np
from datetime import datetime
import featuretools as ft

timestamps = ['2019-01-05-10:36:12', '2019-01-04-11:32:12', '2019-01-03-08:01:03', '2019-01-03-06:32:54',
                '2019-01-01-07:30:24', '2018-12-20-04:20:25']

time = [datetime.strptime(x,'%Y-%m-%d-%H:%M:%S') for x in timestamps]

data = {'time': time,
        'Distributor': ['A','B','A','B','B','B'],
        'Label': [1, 0, 0, 0, 0, 1]}

# Create DataFrame
df = pd.DataFrame(data)
df = df.sort_values(['Distributor','time'])

def past70(g):
    g = g.set_index('time').resample('D').last()
    g['Past_average_label_per_distributor'] = g['Label'].rolling(70, 0).mean().shift(1)
    return g[g.Label.notnull()]

df = df.groupby('Distributor').apply(past70)
df

Now doing this tedious with pandas, as I would like to use many primitives to my problem (say I want also standard deviation of past labels per distributors, but also many other variables grouped_by distributors calculated with a time window)

Here is a failed attempt with featuretools:

import pandas as pd
import numpy as np
from datetime import datetime
import featuretools as ft

timestamps = ['2019-01-05-10:36:12', '2019-01-04-11:32:12', '2019-01-03-08:01:03', '2019-01-03-06:32:54',
                '2019-01-01-07:30:24', '2018-12-20-04:20:25']

time = [datetime.strptime(x,'%Y-%m-%d-%H:%M:%S') for x in timestamps]

data = {'time': time,
        'Distributor': ['A','B','A','B','B','B'],
        'Label': [1, 0, 0, 0, 0, 1]}

# Create DataFrame
df = pd.DataFrame(data)
df = df.sort_values(['Distributor','time'])

cutoff_times = pd.DataFrame({
    "index": df.index,
    "cutoff_time": df['time']
    })

es = ft.EntitySet(id='Sales')
es.entity_from_dataframe(entity_id='Sales', dataframe=df, index='index', make_index=True, time_index='time')
es = es.normalize_entity(base_entity_id='Sales', new_entity_id='Distributors', index='Distributor')

feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity='Sales',
                                      cutoff_time=cutoff_times,
                                      where_primitives=['mean'], features_only=False,
                                      cutoff_time_in_index=False)

feature_matrix # not correct

Any one would have any lead on how to achieve that ? Can't seem to find anything similar in the documentation. Yet this seems to to be a pretty common thing in Machine Learning pre-processing.

Paul
  • 45
  • 3
  • Can you double check the expected output for `mean(Distributor.label)` and show how you can get those values using pandas? – Jeff Hernandez Apr 16 '20 at 19:53
  • Hi, I have edited the above question for clarity. In a nutshell, I want to calculate primitives mean/cumsum/std etc. group by an entity (here a 'Distributor') over a time_window excluding the last label. The idea behind this is that Distributor who had good sales in the past (before point in time T) are a good predictor of purchase for a new record (at point in time T). – Paul Apr 17 '20 at 08:18

1 Answers1

0

You can use cutoff times in DFS to calculate those values. I will go through an example using the same dataset. For reference, this is the output that I get from running your code in Pandas.

                       Distributor  Label  Past_average_label_per_distributor
Distributor time
A           2019-01-03           A    0.0                                 NaN
            2019-01-05           A    1.0                            0.000000
B           2018-12-20           B    1.0                                 NaN
            2019-01-01           B    0.0                            1.000000
            2019-01-03           B    0.0                            0.500000
            2019-01-04           B    0.0                            0.333333

First, we create the dataset.

import pandas as pd
import numpy as np
import featuretools as ft

data = {
    'ID': [0, 1, 2, 3, 4, 5],
    'Distributor': ['A', 'B', 'A', 'B', 'B', 'B'],
    'Label': [1, 0, 0, 0, 0, 1],
    'Time': [
        '2019-01-05-10:36:12',
        '2019-01-04-11:32:12',
        '2019-01-03-08:01:03',
        '2019-01-03-06:32:54',
        '2019-01-01-07:30:24',
        '2018-12-20-04:20:25',
    ],
}

types = {'Time': 'datetime64[ns]'}
df = pd.DataFrame(data).astype(types)
df = df.sort_values(['Distributor', 'Time'])
print(df.to_string(index=False))
               Time Distributor  Label  ID
2019-01-03 08:01:03           A      0   2
2019-01-05 10:36:12           A      1   0
2018-12-20 04:20:25           B      1   5
2019-01-01 07:30:24           B      0   4
2019-01-03 06:32:54           B      0   3
2019-01-04 11:32:12           B      0   1

Then, we build the entity set.

es = ft.EntitySet()

es.entity_from_dataframe(
    entity_id='Sales',
    dataframe=df,
    time_index='Time',
    index='ID',
)

es.normalize_entity(
    base_entity_id='Sales',
    new_entity_id='Distributors',
    index='Distributor',
    make_time_index=False,
)

es.add_last_time_indexes()

es.plot()

enter image description here

Now, we generate the feature matrix using cutoff times.

cutoff_times = df[['Distributor', 'Time', 'Label']]
cutoff_times['Time'] = cutoff_times['Time'].dt.normalize()

fm, _ = ft.dfs(
    target_entity='Distributors',
    entityset=es,
    trans_primitives=[],
    agg_primitives=['mean', 'std'],
    cutoff_time=cutoff_times,
    cutoff_time_in_index=True,
)

print(fm)
                        MEAN(Sales.Label)  STD(Sales.Label)  Label
Distributor time
A           2019-01-03                NaN               NaN      0
            2019-01-05           0.000000               NaN      1
B           2018-12-20                NaN               NaN      1
            2019-01-01           1.000000               NaN      0
            2019-01-03           0.500000          0.707107      0
            2019-01-04           0.333333          0.577350      0

Let me know if this helps. You can also find more information about using cutoff times in this link.

Jeff Hernandez
  • 2,063
  • 16
  • 20