Why does an interval regression using the survival package in R sometimes give questionable standard errors of 0?

Question

I'm running an interval regression model in R using the survival package on interval data. I keep getting estimates where the standard error is 0 and the p-value is infinite in situations where this just does not pass the sniff test. I've also compared my results to an interval regression on the same data in Stata and there are some results that are the same, and others that are different.

I've tried reading into the code to determine what is happening, but I have not been able to figure it out. Are the data breaking some assumption? Is the numerical optimization doing something strange in R? Is this a bug?

library(survival)

data <- structure(list(lb = c(80000, 0, 0, 30000, 30000, 50000, 50000, 
20000, 0, 0, 30000, 30000, 50000, 50000, 20000, 50000, 50000, 
50000, 1e+05, 0, 60000, 130000, 60000, 160000), ub = c(1e+05, 
20000, 20000, 40000, 40000, 60000, 60000, 30000, 20000, 20000, 
40000, 40000, 60000, 60000, 30000, 60000, 60000, 60000, 130000, 
20000, 80000, 160000, 80000, 2e+05), group = c(0, 0, 0, 0, 1, 
1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0)), row.names = c(NA, 
-24L), class = "data.frame")

So this one, using all but the last observation, works fine.

Y <- with(data[1:23,],Surv(lb, ub, event=rep(3, nrow(data[1:23,])), type="interval"))
summary(survreg(Y~group, data=data[1:23,], dist="gaussian", robust=T))

Call:
survreg(formula = Y ~ group, data = data[1:23, ], dist = "gaussian", 
    robust = T)
               Value Std. Err (Naive SE)     z       p
(Intercept) 4.56e+04 9.28e+03   9.42e+03  4.92 8.7e-07
group       5.29e+03 1.36e+04   1.36e+04  0.39     0.7
Log(scale)  1.04e+01 1.87e-01   1.54e-01 55.40 < 2e-16

Scale= 32232 

Gaussian distribution
Loglik(model)= -52.4   Loglik(intercept only)= -52.4
    Chisq= 0.15 on 1 degrees of freedom, p= 0.7 
(Loglikelihood assumes independent observations)
Number of Newton-Raphson Iterations: 2 
n= 23

Now, add on the last observation.

Y <- with(data,Surv(lb, ub, event=rep(3, nrow(data)), type="interval"))
summary(survreg(Y~group, data=data, dist="gaussian", robust=T))

Call:
survreg(formula = Y ~ group, data = data, dist = "gaussian", 
    robust = T)
                Value  Std. Err (Naive SE)     z       p
(Intercept) 55475.198  8221.817   8328.581  6.75 1.5e-11
group       -4364.673     0.000      0.000  -Inf < 2e-16
Log(scale)     10.609     0.202      0.150 52.65 < 2e-16

Scale= 40490 

Gaussian distribution
Loglik(model)= -59   Loglik(intercept only)= -59.1
    Chisq= 0.07 on 1 degrees of freedom, p= 0.79 
(Loglikelihood assumes independent observations)
Number of Newton-Raphson Iterations: 2 
n= 24

It seems crazy that adding on one observation in a 24-observation dataset would yield a standard error of 0. So I compared the results to an interval regression in Stata. The first interval regression yields identical results in both R and Stata. The second, however, is quite different in Stata.

. intreg lb ub group

Fitting constant-only model:

Iteration 0:   log likelihood = -59.128989  
Iteration 1:   log likelihood = -59.054728  
Iteration 2:   log likelihood = -59.054631  
Iteration 3:   log likelihood = -59.054631  

Fitting full model:

Iteration 0:   log likelihood =  -59.15747  
Iteration 1:   log likelihood = -59.019687  
Iteration 2:   log likelihood = -59.019345  
Iteration 3:   log likelihood = -59.019345  

Interval regression                             Number of obs     =         24
                                                LR chi2(1)        =       0.07
Log likelihood = -59.019345                     Prob > chi2       =     0.7905

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       group |  -4441.439   16708.64    -0.27   0.790    -37189.77    28306.89
       _cons |   55510.53   11336.39     4.90   0.000     33291.61    77729.45
-------------+----------------------------------------------------------------
    /lnsigma |   10.60881   .1501972    70.63   0.000     10.31443    10.90319
-------------+----------------------------------------------------------------
       sigma |   40490.08   6081.497                       30164.8    54349.64
------------------------------------------------------------------------------
             0  left-censored observations
             0     uncensored observations
             0 right-censored observations
            24       interval observations

Any thoughts would be appreciated.

@IRTFM But this is an interval regression, which is not scale-invariant and where the interpretation of coefficients is more similar to OLS (a one-unit increase in the independent variable leads to a beta increase in the dependent variable). You can see this yourself by just dividing the upper and lower bounds in my data by 1000 and you get coefficients 1000 times smaller as well. I am using the survival package, but am not running a typical survival analysis. — natecon, Aug 02 '21 at 13:49
Got it. Deleted my noise. I first reordered the values to see if might be something unique about that row, but got no error. Then I substituted the `Surv`-code for the Y name in the formula. I then kept trying to reproduce the error and kept failing. See below. — IRTFM, Aug 02 '21 at 21:18

score 1 · Answer 1 · answered Aug 03 '21 at 15:13

I emailed the package author for some help with this. He was unsure of the exact reason for this behavior, but confirmed there is some sort of bug.

He suggested the simple temporary solution of scaling all values down to lie between 0 and 200. The problem is indeed solved when dividing both bounds by 1000 and interpreting coefficients as in the thousands.

score 0 · Answer 2 · answered Aug 02 '21 at 21:12

When I run:

 summary(survreg(Surv(lb, ub, event=rep(3, nrow(data[1:24,])), type="interval")~group, 
                data=data, dist="gaussian", robust=T))

... I get no error. This is not the first time I have encountered situations where obscure error situations are resolved by not constructing the Surv-object in the global environment, instead putting them in the formula to ensure that they get pulled from the data-object passed to the survival function. However, when I then went back in an attempt to replicate the error I was unsuccessful, so we are now in need of further details to see why you experience is different than mine. Here's my sessionInfo():

Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] survival_3.2-11

loaded via a namespace (and not attached):
[1] compiler_4.0.4  Matrix_1.3-4    tools_4.0.4     splines_4.0.4   grid_4.0.4      lattice_0.20-44

I hope it's not the case that you run a fresh session with a more uptodate R with a matching survival package and reproduce your error. You would then need to send the example to Terry Therneau as a bug report.

Thank you for your help. Your solution still did not work for me. I had emailed Terry Therneau earlier and received an answer last night that this likely is a bug. I will post his response as an answer. — natecon, Aug 03 '21 at 15:05

Why does an interval regression using the survival package in R sometimes give questionable standard errors of 0?

2 Answers2