1

I have data which is generated by intermittent interviews in which an individual is asked whether they are experiencing a certain symptom. The last time each individual was known to not have this particular symptom, is denoted as tstart. If applicable, the time at which the individual is observed to have developed the symptom is tstop. Using the survival package in R, a survival object is created with the Surv function, specifying that this is interval censored data. I would like a non-parametric maximum likelihood estimate of the survival function. This can be accomplished using the survfit function, which seems to pass the call to an internal function survfitTurnbull. The resulting confidence intervals are implausibly wide. I am unable to figure out why this is the case.

# A random sample of the data using dput()
structure(list(tstart = c(0.01, 38, 0.01, 0.01, 23, 26, 0.01, 
19, 0.01, 0.01, 22, 6, 0.01, 14, 16, 0.01, 0.01, 0.01, 0.01, 
21, 15, 0.01, 0.01, 13, 10, 0.01, 0.01, 19, 0.01, 0.01, 0.01, 
0.01, 22, 17, 27, 14, 16, 0.01, 20, 27, 10, 0.01, 0.01, 16, 20, 
7, 6, 15, 0.01, 0.01), tstop = c(4.01, NA, 5.01, 8.01, NA, NA, 
5.01, NA, 3.01, 16.01, NA, 6.01, 8.01, NA, NA, 7.01, 16.01, 1.01, 
10.01, NA, NA, 5.01, 8.01, NA, NA, 2.01, 3.01, NA, 7.01, 5.01, 
2.01, 9.01, NA, NA, NA, NA, NA, 10.01, NA, NA, NA, 5.01, 10.01, 
NA, NA, NA, 7.01, NA, 14.01, 4.01)), row.names = c(NA, -50L), class = "data.frame")

survObj <- with(temp_df, Surv(time = tstart, time2 = tstop, type = "interval2"))
survFit <- survfit(SurvObj ~ 1))
summary(survFit)

The confidence interval does not narrow over time. It is no narrower using the whole dataset (which is contains approximately 10 times the number of events). I am unable to figure out what is going wrong.

user6571411
  • 2,749
  • 4
  • 16
  • 29

2 Answers2

5

For what it's worth, this does not look like a bug in the software, but rather a potential limitation of using something as flexible as the non-parameteric maximum likelihood estimator (NPMLE, also known as the Turnbull estimator, which survfit is fitting if you give it interval censored data) for estimating a survival curve. The TLDR version of this answer is that I suggest you use a parametric model such as Weibull, either using survival::survreg, icenReg::ic_par or icenReg::ic_bayes. Admission of bias: I'm the author of icenReg.

A somewhat technical but very relevant note about the NPMLE is that it only assigns positive probability mass to Turnbull Intervals, which are intervals defined as having the left side of the interval being the left side of some observation interval and the right side of the Turnbull Interval being the next closest right-side of any of the observation intervals. To illustrate, I've plotted your observation intervals and the corresponding Turnbull intervals.

enter image description here

Note that there is a huge gap between the last two Turnbull intervals! This leads to a very "jumpy" NPMLE, which also leads to quite a bit of error in-between the jumps.

After having spent a long time thinking about this issue, my quick summary is that this is a consequence of having only mildly informative data and too much flexibility. In most survival analysis cases, it is reasonable to assume a smooth survival curve, such as a parametric distribution. As long as the distribution is not too overly restrictive (read: the one parameter exponential distribution), this mild assumption of smoothness allows you to gain much more information out of your data without introducing too much bias.

To illustrate, I've attached a plot of a Weibull fit + confidence intervals and the fitted NPMLE next to it.

enter image description here

FYI, the box that you see with the NPMLE is not a confidence interval, but rather that the NPMLE is only unique up to the probability assigned to each Turnbull interval, but how that probability is distributed within the interval does not affect the log-likelihood. So any survival curve that passes through that box maximizes the log-likelihood.

Cliff AB
  • 1,160
  • 8
  • 15
0

I also found that the default (NPLME) confidence intervals of Survfit were surprisingly wide with interval censored data, even using data sets with >50000 observations. They showed counter-intuitive behaviour in relation to sample size (e.g. strata with many more observations sometimes had wider CI than strata with one-tenth the number). I'm no mathematician, but using method = "rothman" or method = "loglog" from km.ci::km.ci produced narrower confidence limits that responded to changes in sample sizes in a manner that was more in line with what I would expect. This article by Sachs et al. (2022) also seems pertinent: https://www.nature.com/articles/s41416-022-01920-5

S_Hartley
  • 11
  • 1
  • This does not provide an answer to the question. Once you have sufficient [reputation](https://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](https://stackoverflow.com/help/privileges/comment); instead, [provide answers that don't require clarification from the asker](https://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-can-i-do-instead). - [From Review](/review/late-answers/34174178) – L Tyrone Apr 08 '23 at 02:30