I'm doing a survival analysis about the time some individual components remain in the source code of a software project, but some of these components are being dropped by the survfit
function.
This is what I'm doing:
library(survival)
data <- read.table(text = "component_id weeks removed
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
7 1 1
8 2 0
9 2 0
10 2 0
11 2 0
12 2 1
13 2 1
14 2 0
15 2 0
16 2 0
17 2 0
18 2 0
19 2 0
20 2 1
21 2 1
22 2 0
23 2 0
24 3 1
25 3 1
26 3 1
27 3 1
28 7 1
29 7 1
30 14 1
31 14 1
32 14 1
33 14 1
34 14 1
35 14 1
36 14 1
37 14 1
38 14 1
39 14 1
40 14 1
41 14 1
42 14 1
43 14 1
44 14 1
45 14 1
46 14 1
47 14 1
48 40 1
49 40 1
50 40 1
51 40 1
52 48 1
53 48 1
54 48 1
55 48 1
56 48 1
57 48 1
58 48 1
59 48 1
60 56 1
61 56 1
62 56 1
63 56 1
64 56 1
65 56 1
66 56 1
67 56 1
68 56 1
69 56 1", header = TRUE)
fit <- survfit(Surv(data$weeks, data$removed) ~ 1)
summary(fit, censored=TRUE)
And this is the output
Call: survfit(formula = Surv(data$weeks, data$removed) ~ 1)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1 69 7 0.899 0.0363 0.830 0.973
2 62 4 0.841 0.0441 0.758 0.932
3 46 4 0.767 0.0533 0.670 0.879
7 42 2 0.731 0.0567 0.628 0.851
14 40 18 0.402 0.0654 0.292 0.553
40 22 4 0.329 0.0629 0.226 0.478
48 18 8 0.183 0.0520 0.105 0.319
56 10 10 0.000 NaN NA NA
I was expecting the number of events to be 69 but I get 12 subjects dropped.
I initially thought I was misusing the package functions, and carried a type="interval2"
approach, following a similar situation, but the drops keep happening with now a weird continuous number of subjects and events counts:
as.t2 <- function(i, data) if (data$removed[i] == 1) data$weeks[i] else NA
size <- length(data$weeks)
t1 <- data$weeks
t2 <- sapply(1:size, as.t2, data = data)
interval_fit <- survfit(Surv(t1, t2, type="interval2") ~ 1)
summary(interval_fit, censored=TRUE)
Next, I found what I call a mid-air explanation, clarifying a bit further the situation. I understand this is caused by non-censored subjects appearing after a "constant censoring time", but again, why?
That led me somehow to dig deeper and read about right-truncation and realized that type of studies mapped very closely to the drops I'm experiencing. Here's Klein & Moeschberger:
Truncation of survival data occurs when only those individuals whose event time lies within a certain observational window $(Y_L,Y_R)$ are observed. An individual whose event time is not in this interval is not observed and no information on this subject is available to the investigator.
Right truncation occurs when $Y_L$ is equal to zero. That is, we observe the survival time $X$ only when $X \leq Y_R$.
From my perspective, these drops carry important information for my study regardless of their time of entry.
How can I stop the drops?