1

Background

For a CRM project, I have snapshots of opportunities. I've been able to build many features w/ featuretools already, but what I really want is to have historical win counts and rate. In other terms, I'd like to know:

For a given opportunity, how many deals have been won up until the opportunity was last modified?

Example Data

import pandas as pd
import featuretools as ft

df = pd.DataFrame(
    {'OpportunityId': [111, 111, 222, 222, 222, 333, 333, 333],
     'UpdateId': [1,3,2,5,7,4,6,8],
     'Label': ['Open', 'Win', 'Open', 'Open', 'Win', 'Open', 'Open', 'Open'],
     'CreatedOn': pd.to_datetime(['9/27/18','9/27/18','9/28/18','9/28/18','9/28/18','10/2/18','10/2/18','10/2/18']),
     'ModifiedOn': pd.to_datetime(['9/27/18','10/1/18','9/28/18','10/3/18','10/7/18','10/2/18','10/6/18','10/10/18']),
     'EstRevenue': [2000, 2000, 80000, 84000, 78000, 100000, 95000, 110000]})
df
| OpportunityId | UpdateId | Label | CreatedOn  | ModifiedOn | EstRevenue |
|---------------|----------|-------|------------|------------|------------|
| 111           | 1        | Open  | 2018-09-27 | 2018-09-27 | 2000       |
| 111           | 3        | Win   | 2018-09-27 | 2018-10-01 | 2000       |
| 222           | 2        | Open  | 2018-09-28 | 2018-09-28 | 80000      |
| 222           | 5        | Open  | 2018-09-28 | 2018-10-03 | 84000      |
| 222           | 7        | Win   | 2018-09-28 | 2018-10-07 | 78000      |
| 333           | 4        | Open  | 2018-10-02 | 2018-10-02 | 100000     |
| 333           | 6        | Open  | 2018-10-02 | 2018-10-06 | 95000      |
| 333           | 8        | Open  | 2018-10-02 | 2018-10-10 | 110000     |

Desired Output

| OPPORTUNITIES | Label | CreatedOn | Max( ModifiedOn ) | AVG( EstRevenue ) | Wins |
|---------------|-------|-----------|-------------------|------------------:|------|
| 111           | Win   | 9/27/18   | 10/1/18           |              2000 | 0    |
| 222           | Win   | 9/28/18   | 10/7/18           |             80667 | 1    |
| 333           | Open  | 10/2/18   | 10/10/18          |            101667 | 2    |

Attempts so far

What is hard for me to wrap my brain around is...

  1. Features that are dependent on multiple opportunities... do I need a separate entity?
  2. How do I aggregate the Label that delivers both:
    1. The current value of the Label, and
    2. The count when the Label column is 0

The challenge for me is the Label column... Whereas normally I would make a CurrentLabel column, I'm pretty sure ft can handle this...

es = (ft.EntitySet(id='CRM')
      .entity_from_dataframe(
          entity_id='updates',
          dataframe=df,
          index='UpdateId',
          time_index='ModifiedOn')
      .normalize_entity(
          base_entity_id='updates',
          new_entity_id='opportunities',
          index='OpportunityId',
          make_time_index='CreatedOn',
          copy_variables=['Label'],
          additional_variables=['CreatedOn']
      )
)
es['updates']['Label'].interesting_values  = ['Win']
Entityset: CRM
  Entities:
    updates [Rows: 8, Columns: 5]
    opportunities [Rows: 3, Columns: 3]
  Relationships:
    updates.OpportunityId -> opportunities.OpportunityId
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_entity="opportunities",
    agg_primitives=[
        "mean","count","num_unique","time_since_first"],
    trans_primitives=[
        'time_since_previous'],
    where_primitives=[
        "sum","count"],
    max_depth=2,
    verbose=1
)
Anders Swanson
  • 3,637
  • 1
  • 18
  • 43
  • what is the logic for aggregated results in columns `'Wins'` and `'Label'`? I understand the other columns, but not these ones. – Valentino Sep 21 '19 at 23:59
  • `Label` is whether the opportunity was 'Won' or 'Lost', which I'm trying to predict. `Wins` column is the answer to the question: how many deals have been won up until the opportunity was last modified? – Anders Swanson Sep 24 '19 at 13:12

1 Answers1

2

There are a couple different ways you could tackle this problem:

  1. Create a new dataframe column and use the last aggregation primitive

For this approach, before you create your entityset, first create a new Wins column in your dataframe that tracks the cumulative total of wins over time. You may need to sort your dataframe by the ModifiedOn column to make sure the cumulative sum value is correct. Also, I am using .shift() here to shift the column values by one position, to only count the wins that occurred before the update:

df = df.sort_values('ModifiedOn')
df['Wins'] = df['Label'].shift().eq('Win').cumsum()

When you run deep feature synthesis, add the last primitive to your list of agg_primitives:

feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_entity="opportunities",
    agg_primitives=["mean", "count", "num_unique", "time_since_first", "last"],
    trans_primitives=["time_since_previous"],
    where_primitives=["sum", "count"],
    max_depth=2,
    verbose=1
)

Now, when you inspect your feature matrix you will have a column labeled LAST(updates.Label) that displays the value for Label at the time the opportunity was last updated. You will also have a column labeled LAST(updates.Wins) that displays the total wins at the time the opportunity was last updated.

  1. Use a seed feature with a custom primitive This approach is conceptually similar as the first approach, but utilizes a seed feature and a custom primitive to arrive at the desired output.

Before you run deep feature synthesis, create a new boolean seed feature that defines whether the Label is equal to Win:

label_is_win = ft.Feature(es["updates"]["Label"]) == "Win"

Next, define a custom transform primitive that will use this seed feature to return the cumulative sum of wins:

class ShiftedCumSumBoolean(ft.primitives.TransformPrimitive):
    name = "shifted_cum_sum_boolean"
    input_types = [ft.variable_types.Boolean]
    return_type = ft.variable_types.Numeric
    uses_full_entity = True

    def get_function(self):
        def shifted_cum_sum(values):
            return values.shift(fill_value=(False)).cumsum()

        return shifted_cum_sum

When you run deep feature synthesis, add the new ShiftedCumSumBoolean primitive to the list of trans_primitives. Also, add the last primitive to the list of agg_primitives to provide the last label value when the opportunity was updated. Finally, add the label_is_win seed feature to the list of seed features in the ft.dfs call:

feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_entity="opportunities",
    agg_primitives=["mean","count","num_unique","time_since_first", "last"],
    trans_primitives=["time_since_previous", ShiftedCumSumBoolean],
    where_primitives=["sum", "count"],
    seed_features=[label_is_win],
    max_depth=2,
    verbose=1,
)

With this solution your feature matrix will have a column labeled LAST(updates.Label) that displays the value for Label at the time the opportunity was last updated. You will also have a column labeled LAST(updates.SHIFTED_CUM_SUM_BOOLEAN(Label = Win)) that displays the total wins at the time the opportunity was last updated.

Nate Parsons
  • 376
  • 1
  • 2