1

I am very new to Machine learning and r, so my question might seem unclear or would need more information. I have tried to explain as much as possible. Please correct me if I have used wrong terminologies or phrases. Any help on this will be greatly appreciated.

Context - I am trying to build a model to predict "when" an event is going to happen.

I have a dataset which has the below structure. This is not the actual data. It is a dummy data created to explain the scenario. Actual data cannot be shared due to confidentiality.

enter image description here

About data -

  • A customer buys a subscription under which he is allowed to use x$ amount of the service provided.
  • A customer can have multiple subscriptions. Subscriptions could be overlapping in time or could be serialized in time
  • Each subscription has a limit on the usage which is x$
  • Each subscription has a startdate and end date.
  • Subscription will no longer be used after enddate.
  • Customer has his own behavior/pattern in which he uses the service. This is described by other derived variables Monthly utilization, avg monthly utilization etc.
  • Customer can use the service above $x. This is indicated by column "ExceedanceMonth" in the table above. Value of 1 says customer went above $x in the first month of the subscription, value 5 says customer went above $x in 5th month of the subscription. Value of NULL indicates that the limit $x is not reached yet. This could be
    either because

     subscription ended and customer didn't overuse
      or 
     subscription is yet to end and customer might overuse in future
    
  • The 2nd scenario after or condition described above is what I want to predict. Among the subscriptions which are yet to end and customer hasn't overused, WHEN will the limit be reached. i.e. predict the ExceedanceMonth column in the above table.
  • Before reaching this model - I have a classification model built using decision tree which predicts if customer is going to cross the limitamount or not i.e. predict if LimitReached = 1 or 0 in next 2 months. I am not sure if I should train the model discussed here (predict time to event) with all the data and test/use the model on customer/subscriptions with Limitreached = 1 or train the model with only the customers/subscription which will have Limitreached = 1

I have researched on survival models. I understand that a survival model like Cox can be used to understand the hazard function and understand how each variable can affect the time to event. I tried to use predict function with cox but I did not understand if any of the values passed to "type" parameter can be used to predict the actual time. i.e. I did not understand how I can predict the actual value for "WHEN" the limit will be crossed

May be survival model isn't the right approach for this scenario. So, please advise me of what could be the best way to approach this problem.

#define survival object 
recsurv <- Surv(time=df$ExceedanceMonth, event=df$LimitReached) 

#only for testing the code
train = subset(df,df$SubStartDate>="20150301" & df$SubEndDate<="20180401") 
test = subset(df,df$SubStartDate>"20180401") #only for testing the code

fit <- coxph(Surv(df$ExceedanceMonth, df$LimitReached) ~ df$SubDurationInMonths+df$`#subs`+df$LimitAmount+df$Monthlyutitlization+df$AvgMonthlyUtilization, train, model = TRUE)
predicted <- predict(fit, newdata = test)
head(predicted)

 1           2           3           4           5           6 
 0.75347328  0.23516619 -0.05535162 -0.03759123 -0.65658488 -0.54233043

Thank you in advance!

Rashmi Shivanna
  • 85
  • 2
  • 14

1 Answers1

1

Survival models are fine for what you're trying to do. (I'm assuming you've estimated the model correctly from this point on.)

The key is understanding what comes out of the model. For a Cox, the default quantity out of predict() is the linear combination (b0 + b1x1 + b2x2..., though the Cox doesn't estimate a b0). That alone won't tell you anything about when.

Specifying type="expected" for predict() will give you when via the expected duration--how long, on average, until the customer reaches his/her data limit, with the follow-up time (how long you watch the customer) set equal to the customer's actual duration (retrieved from the coxph model object).

The coxed package will also give you expected durations, calculated using a different method, without the need to worry about follow-up time. It's also a little more forgiving when it comes to inputting a newdata argument, particularly if you have a specific covariate profile in mind. See the package vignette here.

See also this thread for more on coxph.predict().

SMzgr356
  • 83
  • 6