0

I have some questions regarding the use of coxph() and predict(Surv()). I am aware that my question is a bit long one and maybe I have not explained myself well enough but any comment or suggestion is appreciated.

I am trying to make a Cox PH model and predictions for roof repair on houses. I have 5 input variables (covariates):

House_Age (also called Start), House_Price, Roof_Material_Grp_New, Land_Ownership_Status_Grp, Living_Status_Grp

As the names suggest the first two are numeric variables and the last three at categorical. My problem is that I want to make the House_Age hazard to be depending on time. I chose to do 'datasplitting' for every third year of House_Age (so House_Age turns into the variable Start) - for instance in the case of an Event happening after 7 years the data would look like

Start Stop Event_01_Ts

0    3           0 (Censored)
3    3           0 (Censored)
6    1           1 (Event)

Start is equal to the House_Age. As I could see on the estimates of each value of Start group there seemed to a linear dependence until about 40 years so I chose to have a maximum Start/House_Age of 40 and a linear relation

Cox_Mod_Lin <- coxph(Surv(Stop,Event_01_Ts) ~ Start+Roof_Material_Grp_New+House_Contract_Yen+Land_Ownership_Status_Grp+Living_Status_Grp,data=Abt_Roof_Ts_Mdl)

The model comes out fine with linear coefficient on the Start variable on 0.1916 and an exponential value of 1.211 coef exp(coef) se(coef) z Pr(>|z|)
Start 1.916e-01 1.211e+00 6.817e-03 28.112 < 2e-16 ***

Looking on the Start/House_Age isolated, the hazard increases by 21% each year - is the correct? My problem is that now I want to make predicitons of 'repair' probabilities whitin 1, 5 and 10 years for instance. First I try to find the baseline hazard function by using survfit and a zero-vector as input

Base <- survfit(Cox_Mod_Lin,Abt_Baseline,type='aalen')
Base_Time_Hz <- as.data.frame(cbind(Base$time,Base$cumhaz))
Base_Time_Hz_1yr <- Base_Time_Hz[which(Base_Time_Hz$Time==1),]

Here Abt_Baseline contains zeroes for numerical variables and zero-level group values for groups. From this I find the cumulative hazard for time=1, 5 and 10 (only 1 year is shown) and multiply this with the exponential of the "lp" prediction found by using the predict function.

One year prediction:

Pred_01<-Base_Time_Hz_1yr$Cumhaz*exp(predict(Cox_Mod_Lin,Abt_Roof_Score, type="lp"))

This would be OK if there were no time dependent inputs but the hazard varies in the future as a function of the Start (=House_Age) variable. I know the future values of the Start (increase by 1 each year) so I suppose that I could integrate somehow over the prediction period. So I have two main questions:

  • Does this seem to be a sensible way for doing the modeling and (some of) the prediction?
  • If yes - how do I do the integration over the prediction period with with respect to the varying (increasing) House_Age/Start hazard?

Can anybody help me?

Adriaan
  • 17,741
  • 7
  • 42
  • 75
Peter S
  • 55
  • 1
  • 11
  • The modelling part seems sensible to me, but not the prediction part. I don't think you need the baseline hazard. I'd suggest you take a look at the `predictSurvProb` function in the `pec` package. This function makes predictions for new data based on your coxph model for a given point in time (e.g. 1, 5, 10 year). – StatMan Jul 19 '16 at 09:49
  • Also take a look at for how to adjust your data to be able to make predictions in the case of time varying covariates. [You probably won't use the timeSplitter function he is talking about, but the logic is the same] – StatMan Jul 19 '16 at 10:00
  • Thanks a lot MarcelG. I was not too sure of the prediction part myself. I will have a look at your suggestions. – Peter S Jul 19 '16 at 23:51
  • I have one more question to you @MarcelG. Thanks to the predictSurvProb function I was able to predict the survival probability one year ahead. So what I did was time splitting the data that I wanted to predict so I had - for each house - one line for each year in the future (10 years ahead) with corresponding values of House_Age (1 year older each year) and the rest of the input variables constant. I found the one-year-ahead-probability for each year (by predictSurvProb) and in order to find the probability after 10 years I multiplied the probabilities for each year. Does this sound sensible? – Peter S Jul 21 '16 at 05:36
  • It seems sensible, but I can't give you a definite answer. However, follow my line of reasoning. Suppose you have a house of 10 years old. You now predicted the survival probability for 1 year for a 10, 11, .., 19 years old house (all else equal), and multiplied those probabilities. Suppose we are at time t and the time of Death is D, you basically calculated the chances P(D > t + 1 | House_Age). If the house is 19 years old, the probability must be P(D > t + 10 | D > t + 9, House_Age) = (?) P(D > t + 1 | House_Age). Therefore it seems sensible to me, however i am not sure (see the ?) – StatMan Jul 21 '16 at 07:56
  • @MarcelG - once again thank you very much it has been very useful for me. I am not sure I understand your formula - I think would write it like this: P(D>t|House_Age) = P(D>1|House_Age)*P(D>1|House_Age+1)* ... *P(D>1|House_Age+(t-1)). Maybe it is the same as you say. I am new to R and survival models and I am a bit surprised that it has been hard to find examples with predictions for breakdown where the hazard depends on the age. I would think that this is a normal problem but maybe I searched the wrong places (Cox PH models), maybe there are more suited models. – Peter S Jul 22 '16 at 01:18
  • In my formula I assumed we are at point in time _t_. However, in survival analysis we always start at time _0_ (so my bad). Furthermore, I meant the same thing as you said so sorry for being unclear. – StatMan Jul 22 '16 at 07:04
  • To comment on your remark that its hard to find examples. Yes, very true. Prediction with time-varying covariates in Cox PH models are very very sparse. I worked with these models during my thesis. However, it is possible but very tedious (as you already noticed). If you are considering other models, try ATF (Accelerated Time Failure) models. – StatMan Jul 22 '16 at 07:07
  • Thanks again @MarcelG. Yes, this is bit more 'messy' than ordinary linear regression, someone should have warned me. I will have a look at the Accelerated Time Failure models - I saw them mentioned somewhere, sounds like they might be easier to use than all this multiplication trouble. – Peter S Jul 22 '16 at 08:46
  • Good luck! I just noticed that another error of me. It is an AFT model not ATF ;) – StatMan Jul 22 '16 at 09:15

0 Answers0