0

I am trying to create a custom TransformPrimitive in Featuretools to calculate rolling statistics like the rolling sum or mean.

This article explains well how to go about such task using Pandas. It shows how to get things running when using the 'window' parameter to represent the number of observations used for calculating the statistic.

However, I intend to provide a string input to calculate an offset in days. Below line calculates correctly what I need, conceptually.

transactions.groupby('ID').rolling(window='10D', on='TransactionDate')[['Quantity','AmountPaid']].sum()

The TransformPrimitive looks as follows:

class RollingSum(TransformPrimitive):
    """Calculates the rolling sum.

    Description:
        Given a list of values, return the rolling sum.
    """

    name = "rolling_sum"
    input_types = [NaturalLanguage,NaturalLanguage]
    return_type = Numeric
    uses_full_entity = True
    description_template = "the rolling sum of {}"

    def __init__(self, window=None, on=None):
        self.window = window
        self.on = on

    def get_function(self):
        def rolling_sum(values):
            """method is passed a pandas series"""
            return values.rolling(window=self.window, on=self.on).sum()

        return rolling_sum

I tried to pass the TransactionDate variable from the entityset:

features_defs = ft.dfs(
    entityset=es,
    max_depth=2,
    target_entity='CUSTOMER',
    agg_primitives=['sum'], 
    groupby_trans_primitives=[
      RollingSum(window='10D', on=es['TRANSACTION']['TransactionDate'])
    ], 
    cutoff_time = label_times,
    cutoff_time_in_index=False,
    include_cutoff_time=False,
    features_only=True
)

But without success. I am getting the Unused Primitive Warning:

Some specified primitives were not used during DFS: groupby_trans_primitives: ['rolling_sum'] This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible variable types for the primitive were found in the data. warnings.warn(warning_msg, UnusedPrimitiveWarning)

Many thanks for your suggestions!

S-UP
  • 83
  • 1
  • 8

1 Answers1

2

You’re on the right track with trying to provide the Datetime Variable, es['TRANSACTION']['TransactionDate'], to the on parameter, but Pandas won’t know what to do with a Featuretools Variable, so this could be a good opportunity to create a new Primitive, RollingSumOnDatetime.

There’s a few changes you can make to the RollingSum primitive here so that it can use your datetime column.

  1. input_types should take be [Numeric, DatetimeTimeIndex] since the datetime column that’s used for the rolling average must be present in the data that’s used to make the pd.DataFrame.rolling call. The Numeric variable is because rolling can only be calculated on numeric columns. The DatetimeTimeIndex variable ensures that the series will be a monotonic Datetime (since featuretools will sort time indices), which is another requirement to calculate the rolling sum.
  2. The rolling_sum function should combine the Numeric and DatetimeTimeIndex columns into a single DataFrame and rolling should be calculated from that with the desired window.

I’m imagining that the Primitive looks something like this:

class RollingSumOnDatetime(TransformPrimitive):
    """Calculates the rolling sum on a Datetime time index column.
    Description:
        Given a list of values and a Datetime time index, return the rolling sum.
    """
    name = "rolling_sum_on_datetime"
    input_types = [Numeric, DatetimeTimeIndex]
    return_type = Numeric
    uses_full_entity = True
    description_template = "the rolling sum of {} on {}"
    def __init__(self, window=None):
        self.window = window
    def get_function(self):
        def rolling_sum(to_roll, on_column):
            """method is passed a pandas series"""
            #create a DataFrame that has the both columns in it
            df = pd.DataFrame({to_roll.name:to_roll, on_column.name:on_column})
            rolled_df = df.rolling(window=self.window, on=on_column.name).sum()
            return rolled_df[to_roll.name]
        return rolling_sum
Tamar Grey
  • 101
  • 3