I am new to Stack Overflow! Sorry in advance if this is a stupid or confusing question.
I have a set of right censored longitudinal data (aka survival data) which contains workers' failure (resignation) time, work location and monthly salary. My goal is to predict/simulate the failure time of each worker. Hence, given the fact that the hazard rate approximates the conditional probability of failure if the change of time is small, I decide to simulate the failure time of each worker based on Cox-proportional hazard model. Here are my steps:
- I split the original dataset into training and testing set. The training set was used for training the Cox-proportional hazard model.
- Based on the estimated coefficients, I estimated the cumulative baseline hazard function and thus the baseline hazard function could be obtained.
- I computed the individual hazard rates of each time period and worker (based on the testing set). I constructed a matrix (columns = simulated days, row = worker) to store all the rates.
- I chose two ways to simulate/predict the failure time of each worker:
4.1
I use the uniform distribution to generate random probabilities of each simulated day and the failure day is the first simulated day that worker having hazard rate greater than the generated probability. I repeat this step for n iterations. However, this result contains large amount of ``nan’’ because some workers do not have any hazard rate greater the generated probability. Hence, it is difficult to diecide the failure time.
4.2
I simply treat the simulated day, having the greatest individual hazard rate, as the failure time of the each worker. However, this way doesn't work well because workers might have very constant and small individual hazard rates.
I have tried parametric models but the computation takes super long because my dataset is very large (>800000 rows)
My question: is there any suggestion of simulating/predicting the failure time of each worker?
Thank you very much!
Jeff