Background
For a CRM project, I have snapshots of opportunities. I've been able to build many features w/ featuretools
already, but what I really want is to have historical win counts and rate. In other terms, I'd like to know:
For a given opportunity, how many deals have been won up until the opportunity was last modified?
Example Data
import pandas as pd
import featuretools as ft
df = pd.DataFrame(
{'OpportunityId': [111, 111, 222, 222, 222, 333, 333, 333],
'UpdateId': [1,3,2,5,7,4,6,8],
'Label': ['Open', 'Win', 'Open', 'Open', 'Win', 'Open', 'Open', 'Open'],
'CreatedOn': pd.to_datetime(['9/27/18','9/27/18','9/28/18','9/28/18','9/28/18','10/2/18','10/2/18','10/2/18']),
'ModifiedOn': pd.to_datetime(['9/27/18','10/1/18','9/28/18','10/3/18','10/7/18','10/2/18','10/6/18','10/10/18']),
'EstRevenue': [2000, 2000, 80000, 84000, 78000, 100000, 95000, 110000]})
df
| OpportunityId | UpdateId | Label | CreatedOn | ModifiedOn | EstRevenue |
|---------------|----------|-------|------------|------------|------------|
| 111 | 1 | Open | 2018-09-27 | 2018-09-27 | 2000 |
| 111 | 3 | Win | 2018-09-27 | 2018-10-01 | 2000 |
| 222 | 2 | Open | 2018-09-28 | 2018-09-28 | 80000 |
| 222 | 5 | Open | 2018-09-28 | 2018-10-03 | 84000 |
| 222 | 7 | Win | 2018-09-28 | 2018-10-07 | 78000 |
| 333 | 4 | Open | 2018-10-02 | 2018-10-02 | 100000 |
| 333 | 6 | Open | 2018-10-02 | 2018-10-06 | 95000 |
| 333 | 8 | Open | 2018-10-02 | 2018-10-10 | 110000 |
Desired Output
| OPPORTUNITIES | Label | CreatedOn | Max( ModifiedOn ) | AVG( EstRevenue ) | Wins |
|---------------|-------|-----------|-------------------|------------------:|------|
| 111 | Win | 9/27/18 | 10/1/18 | 2000 | 0 |
| 222 | Win | 9/28/18 | 10/7/18 | 80667 | 1 |
| 333 | Open | 10/2/18 | 10/10/18 | 101667 | 2 |
Attempts so far
What is hard for me to wrap my brain around is...
- Features that are dependent on multiple opportunities... do I need a separate entity?
- How do I aggregate the
Label
that delivers both:- The current value of the
Label
, and - The count when the
Label
column is 0
- The current value of the
The challenge for me is the Label
column... Whereas normally I would make a CurrentLabel
column, I'm pretty sure ft
can handle this...
es = (ft.EntitySet(id='CRM')
.entity_from_dataframe(
entity_id='updates',
dataframe=df,
index='UpdateId',
time_index='ModifiedOn')
.normalize_entity(
base_entity_id='updates',
new_entity_id='opportunities',
index='OpportunityId',
make_time_index='CreatedOn',
copy_variables=['Label'],
additional_variables=['CreatedOn']
)
)
es['updates']['Label'].interesting_values = ['Win']
Entityset: CRM
Entities:
updates [Rows: 8, Columns: 5]
opportunities [Rows: 3, Columns: 3]
Relationships:
updates.OpportunityId -> opportunities.OpportunityId
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_entity="opportunities",
agg_primitives=[
"mean","count","num_unique","time_since_first"],
trans_primitives=[
'time_since_previous'],
where_primitives=[
"sum","count"],
max_depth=2,
verbose=1
)