1

I'm new to survival analysis and quickly coded a baseline survival function with Python's Lifeline. But the graphs I get are too optimistic compared to the retention curves I plotted myself based on my data.

Below is a graph I made based on my data. It shows how many weeks clients from different cohorts stay. It is clear that people from Belgium (red line) stay longer than people from The Netherlands. After 20 weeks, 56% of Belgian customers are still there, while there are only 43% of the Dutchman.

Retention curve Retention curve

But when I use Python's lifeline CoxPHFitter to plot the survival function, I get the graphs below:

enter image description here

It shows that the probability of "being alive" after 20 weeks is more than 70% for Belgian people and more than 50% for dutchman.

Why are these numbers different? Did I misinterpreted one of the curves?

This is my code:

cph = CoxPHFitter()  
cph.fit(dumies, 'weeks_subscribed', event_col='stopped')  
cph.plot_partial_effects_on_outcome('addresses__address__country__name_Nederland',values=[0,1])

Where 'stopped' is set to 1 if the customer isn't subscribed anymore. The average length of 'weeks_subscribed' is 18.

EDIT:

The way I calculated the retention graph manually is as follows:

def add_time_subscribed(rd):
    rd['weeks_subscribed'] = 0
    for index, row in rd.iterrows():
        if (not row['stopped']) and (not row['_paused']):
            end_date = datetime.now(tz=pytz.UTC)
        else:
            end_date = row['paused_at']
                
        rd.loc[index,'weeks_subscribed'] = (end_date - row['subscribed_at']).days/7
        

def stayers_per_week(rd):
    y_axis = np.zeros(int(rd['weeks_subscribed'].max())+1)
    for index, row in rd.iterrows():
        for i in range(int(row['weeks_subscribed'])+1):
            y_axis[i] += 1
    x_axis = [i for i in range(len(y_axis))]
    return x_axis, y_axis/y_axis[0]
Ruben_G
  • 11
  • 3
  • Can you provide a formula (i.e. not code) of how your retention curved is defined? Is censoring present in your dataset? How does your retention curve handle censoring? – Cam.Davidson.Pilon Jul 28 '21 at 20:36

1 Answers1

0

There are many possible reasons for which the two graphs are not similar.

I advise you that you plot the KaplanMeir fitted survival function :

from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()
kmf.fit(dummies['weeks_suscribed'], event_observed=dummies['stopped']) 
kmf.plot_survival_function()

Then you could also plot for 'addresses__address__country__name_Nederland' taking values 0 and 1 :

kmf = KaplanMeierFitter()
for value in [0,1]:      
    kmf.fit(dummies[dummies['addresses__address__country__name_Nederland']==value]['weeks_suscribed'], event_observed=dummies[dummies['addresses__address__country__name_Nederland']==value]['stopped']) 
    kmf.plot_survival_function()

This might give you some better insight on the lack of coherence

Adrien
  • 433
  • 1
  • 3
  • 13
  • What are then all the reasons for which the two graphs are not similar? Shouldn't they show the same? Btw, the graphs you proposed are almost exactly the same as the previous graphs. – Ruben_G Jul 27 '21 at 13:51
  • Are the graphs that i suggested to you the same as your own or similar to the graph displayed by the CoxPH method ? – Adrien Jul 27 '21 at 13:57
  • Yes, I just plotted them and they are almost the same as the CoxPH method. But why are those different from my non-lifeline plot? – Ruben_G Jul 27 '21 at 14:05
  • Are you sure that the scale is the same in both graphs ? Always in weeks ? If yes, I think the error comes from your method from plotting yourself the survival function. Would you show us how you proceeded ? – Adrien Jul 27 '21 at 14:37
  • Sure, I added my code for the manual calculation in the edit! – Ruben_G Jul 27 '21 at 14:55