0

I'm trying to work through the methodology for churn prediction I found here:

Let's say today is 1/6/2017. I have a pandas dataframe, df, that I want to add two columns to.

df = pd.DataFrame([
    ['a', '2017-01-01', 0],
    ['a', '2017-01-02', 0],
    ['a', '2017-01-03', 0],
    ['a', '2017-01-04', 1],
    ['a', '2017-01-05', 1],
    ['b', '2017-01-01', 0],
    ['b', '2017-01-02', 1],
    ['b', '2017-01-03', 0],
    ['b', '2017-01-04', 0],
    ['b', '2017-01-05', 0]
    ]
    ,columns=['id','date','is_event']
)
df['date'] = pd.to_datetime(df['date'])

One is time_to_next_event and the other is is_censored. time_to_next_event will, within each id, decrease towards zero as an event gets closer in time. If no event exists before today, time_to_next_event will decrease in value until the end of the group.

is_censored is a binary flag related to this phenomenon and will indicate, within each id, the rows which have occurred between the most recent event and today. For id a, the most recent row contains the event so is_censored is zero for the whole group. For id b, there are three rows between the most recent event and today so each of their is_censored values are 1.

desired = pd.DataFrame([
    ['a', '2017-01-01', 0, 3, 0],
    ['a', '2017-01-02', 0, 2, 0],
    ['a', '2017-01-03', 0, 1, 0],
    ['a', '2017-01-04', 1, 0, 0],
    ['a', '2017-01-05', 1, 0, 0],
    ['b', '2017-01-01', 0, 1, 0],
    ['b', '2017-01-02', 1, 0, 0],
    ['b', '2017-01-03', 0, 3, 1],
    ['b', '2017-01-04', 0, 2, 1],
    ['b', '2017-01-05', 0, 1, 1]
    ]
    ,columns=['id','date','is_event','time_to_next_event', 'is_censored']
)
desired['date'] = pd.to_datetime(desired['date'])

For time_to_next_event, I found this SO question but had trouble getting it to fit my use case.

For is_censored, I'm stumped so far. I'm posting this question in the hopes that some benevolent Stack Overflower will take pity on me while I sleep (working in EU) and I'll take another stab at this tomorrow. Will update with anything I find. Many thanks in advance!

drew_is_good
  • 163
  • 1
  • 11
  • Could you explain the definition of `is_censored` some more? Having trouble reconciling your definition with the contents of that column in the `desired` df. Do you mean that that column should contain the index of the row of the last seen event within that id group? – LateCoder Jun 20 '17 at 20:33
  • Also, just to confirm, are you missing a row for id 'b' that contains the event, given that on 2017-01-05 you're 1 day away from an event? – LateCoder Jun 20 '17 at 20:35
  • Hi LateCoder, thanks for that call-out. I've elaborated a bit, does it make sense now? – drew_is_good Jun 20 '17 at 20:57
  • The `time_to_next_event` makes sense. Still not sure about `is_censored` - maybe you could give a concrete example of why 2017-01-03, 2017-01-04 and 2017-01-05 all have a value of 1? Based on the wording, it sounds like `is_censored` is supposed to identify all rows that occurred between the most recent event and today, so I would think that there should be a list of row ids in each cell, or something to that effect. – LateCoder Jun 20 '17 at 21:03
  • One more edit, how does that look? – drew_is_good Jun 20 '17 at 21:35
  • Yes, that makes sense. Thanks! – LateCoder Jun 20 '17 at 22:56

2 Answers2

2

To get the days until the next event, we can add a column that backfills the date of the next event:

df['next_event'] = df['date'][df['is_event'] == 1]
df['next_event'] = df.groupby('id')['next_event'].transform(lambda x: x.fillna(method='bfill'))

We can then just subtract to get the days between the next event and each day:

df['next_event'] = df['next_event'].fillna(df['date'].iloc[-1] + pd.Timedelta(days=1))
df['time_to_next_event'] = (df['next_event']-df['date']).dt.days

To get the is_censored value for each day and each id, we can group by id, and then we can forward-fill based on the 'is_event' column for each group. Now, we just need the forward-filled values, since according to the definition above, the value of 'is_censored' should be 0 on the day of the event itself. So, we can compare the 'is_event' column to the forward-filled version of that column and set 'is_censored' to 1 each time we have a forward-filled value that wasn't in the original.

df['is_censored'] = (df.groupby('id')['is_event'].transform(lambda x: x.replace(0, method='ffill')) != df['is_event']).astype(int)
df = df.drop('next_event', axis=1)    

    In [343]: df
    Out[343]:
  id       date  is_event  time_to_next_event  is_censored
0  a 2017-01-01         0                   3            0
1  a 2017-01-02         0                   2            0
2  a 2017-01-03         0                   1            0
3  a 2017-01-04         1                   0            0
4  a 2017-01-05         1                   0            0
5  b 2017-01-01         0                   1            0
6  b 2017-01-02         1                   0            0
7  b 2017-01-03         0                   3            1
8  b 2017-01-04         0                   2            1
9  b 2017-01-05         0                   1            1
LateCoder
  • 2,163
  • 4
  • 25
  • 44
0

To generalize the method for is_censored to include cases where an event happens more than once within each id, I wrote this:

df['is_censored2'] = 1

max_dates = df[df['is_event'] == 1].groupby('id',as_index=False)['date'].max()
max_dates.columns = ['id','max_date']
df = pd.merge(df,max_dates,on=['id'],how='left')

df['is_censored2'][df['date'] <= df['max_date']] = 0

It initializes the column at 1 then grabs the max date associated with an event within each id and populates a 0 in is_censored2 if there are any dates in id that are less than or equal to it.

drew_is_good
  • 163
  • 1
  • 11