I'm running an interval regression model in R using the survival package on interval data. I keep getting estimates where the standard error is 0 and the p-value is infinite in situations where this just does not pass the sniff test. I've also compared my results to an interval regression on the same data in Stata and there are some results that are the same, and others that are different.
I've tried reading into the code to determine what is happening, but I have not been able to figure it out. Are the data breaking some assumption? Is the numerical optimization doing something strange in R? Is this a bug?
library(survival)
data <- structure(list(lb = c(80000, 0, 0, 30000, 30000, 50000, 50000,
20000, 0, 0, 30000, 30000, 50000, 50000, 20000, 50000, 50000,
50000, 1e+05, 0, 60000, 130000, 60000, 160000), ub = c(1e+05,
20000, 20000, 40000, 40000, 60000, 60000, 30000, 20000, 20000,
40000, 40000, 60000, 60000, 30000, 60000, 60000, 60000, 130000,
20000, 80000, 160000, 80000, 2e+05), group = c(0, 0, 0, 0, 1,
1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0)), row.names = c(NA,
-24L), class = "data.frame")
So this one, using all but the last observation, works fine.
Y <- with(data[1:23,],Surv(lb, ub, event=rep(3, nrow(data[1:23,])), type="interval"))
summary(survreg(Y~group, data=data[1:23,], dist="gaussian", robust=T))
Call:
survreg(formula = Y ~ group, data = data[1:23, ], dist = "gaussian",
robust = T)
Value Std. Err (Naive SE) z p
(Intercept) 4.56e+04 9.28e+03 9.42e+03 4.92 8.7e-07
group 5.29e+03 1.36e+04 1.36e+04 0.39 0.7
Log(scale) 1.04e+01 1.87e-01 1.54e-01 55.40 < 2e-16
Scale= 32232
Gaussian distribution
Loglik(model)= -52.4 Loglik(intercept only)= -52.4
Chisq= 0.15 on 1 degrees of freedom, p= 0.7
(Loglikelihood assumes independent observations)
Number of Newton-Raphson Iterations: 2
n= 23
Now, add on the last observation.
Y <- with(data,Surv(lb, ub, event=rep(3, nrow(data)), type="interval"))
summary(survreg(Y~group, data=data, dist="gaussian", robust=T))
Call:
survreg(formula = Y ~ group, data = data, dist = "gaussian",
robust = T)
Value Std. Err (Naive SE) z p
(Intercept) 55475.198 8221.817 8328.581 6.75 1.5e-11
group -4364.673 0.000 0.000 -Inf < 2e-16
Log(scale) 10.609 0.202 0.150 52.65 < 2e-16
Scale= 40490
Gaussian distribution
Loglik(model)= -59 Loglik(intercept only)= -59.1
Chisq= 0.07 on 1 degrees of freedom, p= 0.79
(Loglikelihood assumes independent observations)
Number of Newton-Raphson Iterations: 2
n= 24
It seems crazy that adding on one observation in a 24-observation dataset would yield a standard error of 0. So I compared the results to an interval regression in Stata. The first interval regression yields identical results in both R and Stata. The second, however, is quite different in Stata.
. intreg lb ub group
Fitting constant-only model:
Iteration 0: log likelihood = -59.128989
Iteration 1: log likelihood = -59.054728
Iteration 2: log likelihood = -59.054631
Iteration 3: log likelihood = -59.054631
Fitting full model:
Iteration 0: log likelihood = -59.15747
Iteration 1: log likelihood = -59.019687
Iteration 2: log likelihood = -59.019345
Iteration 3: log likelihood = -59.019345
Interval regression Number of obs = 24
LR chi2(1) = 0.07
Log likelihood = -59.019345 Prob > chi2 = 0.7905
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
group | -4441.439 16708.64 -0.27 0.790 -37189.77 28306.89
_cons | 55510.53 11336.39 4.90 0.000 33291.61 77729.45
-------------+----------------------------------------------------------------
/lnsigma | 10.60881 .1501972 70.63 0.000 10.31443 10.90319
-------------+----------------------------------------------------------------
sigma | 40490.08 6081.497 30164.8 54349.64
------------------------------------------------------------------------------
0 left-censored observations
0 uncensored observations
0 right-censored observations
24 interval observations
Any thoughts would be appreciated.