I have some questions regarding the use of coxph() and predict(Surv()). I am aware that my question is a bit long one and maybe I have not explained myself well enough but any comment or suggestion is appreciated.
I am trying to make a Cox PH model and predictions for roof repair on houses. I have 5 input variables (covariates):
House_Age (also called Start), House_Price, Roof_Material_Grp_New, Land_Ownership_Status_Grp, Living_Status_Grp
As the names suggest the first two are numeric variables and the last three at categorical. My problem is that I want to make the House_Age hazard to be depending on time. I chose to do 'datasplitting' for every third year of House_Age (so House_Age turns into the variable Start) - for instance in the case of an Event happening after 7 years the data would look like
Start Stop Event_01_Ts
0 3 0 (Censored)
3 3 0 (Censored)
6 1 1 (Event)
Start is equal to the House_Age. As I could see on the estimates of each value of Start group there seemed to a linear dependence until about 40 years so I chose to have a maximum Start/House_Age of 40 and a linear relation
Cox_Mod_Lin <- coxph(Surv(Stop,Event_01_Ts) ~ Start+Roof_Material_Grp_New+House_Contract_Yen+Land_Ownership_Status_Grp+Living_Status_Grp,data=Abt_Roof_Ts_Mdl)
The model comes out fine with linear coefficient on the Start variable on 0.1916 and an exponential value of 1.211
coef exp(coef) se(coef) z Pr(>|z|)
Start 1.916e-01 1.211e+00 6.817e-03 28.112 < 2e-16 ***
Looking on the Start/House_Age isolated, the hazard increases by 21% each year - is the correct? My problem is that now I want to make predicitons of 'repair' probabilities whitin 1, 5 and 10 years for instance. First I try to find the baseline hazard function by using survfit and a zero-vector as input
Base <- survfit(Cox_Mod_Lin,Abt_Baseline,type='aalen')
Base_Time_Hz <- as.data.frame(cbind(Base$time,Base$cumhaz))
Base_Time_Hz_1yr <- Base_Time_Hz[which(Base_Time_Hz$Time==1),]
Here Abt_Baseline contains zeroes for numerical variables and zero-level group values for groups. From this I find the cumulative hazard for time=1, 5 and 10 (only 1 year is shown) and multiply this with the exponential of the "lp" prediction found by using the predict function.
One year prediction:
Pred_01<-Base_Time_Hz_1yr$Cumhaz*exp(predict(Cox_Mod_Lin,Abt_Roof_Score, type="lp"))
This would be OK if there were no time dependent inputs but the hazard varies in the future as a function of the Start (=House_Age) variable. I know the future values of the Start (increase by 1 each year) so I suppose that I could integrate somehow over the prediction period. So I have two main questions:
- Does this seem to be a sensible way for doing the modeling and (some of) the prediction?
- If yes - how do I do the integration over the prediction period with with respect to the varying (increasing) House_Age/Start hazard?
Can anybody help me?