I'm trying to work through the methodology for churn prediction I found here:
Let's say today is 1/6/2017. I have a pandas dataframe, df, that I want to add two columns to.
df = pd.DataFrame([
['a', '2017-01-01', 0],
['a', '2017-01-02', 0],
['a', '2017-01-03', 0],
['a', '2017-01-04', 1],
['a', '2017-01-05', 1],
['b', '2017-01-01', 0],
['b', '2017-01-02', 1],
['b', '2017-01-03', 0],
['b', '2017-01-04', 0],
['b', '2017-01-05', 0]
]
,columns=['id','date','is_event']
)
df['date'] = pd.to_datetime(df['date'])
One is time_to_next_event
and the other is is_censored
. time_to_next_event
will, within each id, decrease towards zero as an event gets closer in time. If no event exists before today, time_to_next_event
will decrease in value until the end of the group.
is_censored
is a binary flag related to this phenomenon and will indicate, within each id, the rows which have occurred between the most recent event and today. For id a, the most recent row contains the event so is_censored
is zero for the whole group. For id b, there are three rows between the most recent event and today so each of their is_censored
values are 1.
desired = pd.DataFrame([
['a', '2017-01-01', 0, 3, 0],
['a', '2017-01-02', 0, 2, 0],
['a', '2017-01-03', 0, 1, 0],
['a', '2017-01-04', 1, 0, 0],
['a', '2017-01-05', 1, 0, 0],
['b', '2017-01-01', 0, 1, 0],
['b', '2017-01-02', 1, 0, 0],
['b', '2017-01-03', 0, 3, 1],
['b', '2017-01-04', 0, 2, 1],
['b', '2017-01-05', 0, 1, 1]
]
,columns=['id','date','is_event','time_to_next_event', 'is_censored']
)
desired['date'] = pd.to_datetime(desired['date'])
For time_to_next_event, I found this SO question but had trouble getting it to fit my use case.
For is_censored, I'm stumped so far. I'm posting this question in the hopes that some benevolent Stack Overflower will take pity on me while I sleep (working in EU) and I'll take another stab at this tomorrow. Will update with anything I find. Many thanks in advance!