Data Cleaning for Survival Analysis

Question

I’m in the process of cleaning some data for a survival analysis and I am trying to make it so that an individual only has a single, sustained, transition from symptom present (ss=1) to symptom remitted (ss=0). An individual must have a complete sustained remission in order for it to count as a remission. Statistical problems/issues aside, I’m wondering how I can go about addressing the issues detailed below.

I’ve been trying to break the problem apart into smaller, more manageable operations and objects, however, the solutions I keep coming to force me to use conditional formatting based on rows immediately above and below the a missing value and, quite frankly, I’m at a bit of a loss as to how to do this. I would love a little guidance if you think you know of a good technique I can use, experiment with, or if you know of any good search terms I can use when looking up a solution.

The details are below:

#Fake dataset creation
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
ss <- c(1,1,1,1,NA,0,0,1,1,0,NA,0,0,0,1,1,1,1,1,1,NA,1,1,0,NA,NA,0,0)
mydat <- data.frame(id, time, ss)

*Bold and underlined characters represent changes from the dataset above

The goal here is to find a way to get the NA values for ID #1 (variable ss) to look like this: 1,1,1,1,1,0,0

ID# 2 (variable ss) to look like this: 1,1,0,0,0,0,0

ID #3 (variable ss) to look like this: 1,1,1,1,1,1,NA (no change because the row with NA will be deleted eventually)

ID #4 (variable ss) to look like this: 1,1,1,1,1,0,0 (this one requires multiple changes and I expect it is the most challenging to tackle).

Is there an operational definition of _sustained_ in terms of time rather than later evidence of recurrence? Will we be excluding the time under observation of case whose remission has been greater than that time from the risk set? — IRTFM, Apr 19 '16 at 04:13
Sustained in this case means that an individual ostensibly was symptom free (ss=0) through the last time-point. Missing data, of course, throws a wrench into the gears, but that aside for now, I'm interested in developing code to accomplish the task outlined above. — Jonah M., Apr 19 '16 at 04:16
I'm assuming this is in the analysis of treated cancer with probable fatal (or expensive) outcome for recurrence. I would drop the term "sustained" since "survival" is so confounded with the time under observation and time at t= 1 is being considered on the same basis as time at t=10. Instead refer to recurrence-free survival. It's already a sufficiently confusing statistic since an event of principal importance (death from other causes) has been recast as a censoring process. — IRTFM, Apr 19 '16 at 04:34

score 0 · Accepted Answer · answered Apr 19 '16 at 04:51

I don't really think you have considered all the "edge case". What to do with two NA's in a row at the end of a period or 4 or 5 NA's in a row. This will give you the requested solution in your tiny test case, however, using the na.locf-function:

require(zoo)
fillNA <- function(vec) { if ( is.na(tail(vec, 1)) ){ vec } else { vec <- na.locf(vec) }
                         }

> mydat$locf <- with(mydat, ave(ss, id, FUN=fillNA))
> mydat
   id time ss locf
1   1    0  1    1
2   1    1  1    1
3   1    2  1    1
4   1    3  1    1
5   1    4 NA    1
6   1    5  0    0
7   1    6  0    0
8   2    0  1    1
9   2    1  1    1
10  2    2  0    0
11  2    3 NA    0
12  2    4  0    0
13  2    5  0    0
14  2    6  0    0
15  3    0  1    1
16  3    1  1    1
17  3    2  1    1
18  3    3  1    1
19  3    4  1    1
20  3    5  1    1
21  3    6 NA   NA
22  4    0  1    1
23  4    1  1    1
24  4    2  0    0
25  4    3 NA    0
26  4    4 NA    0
27  4    5  0    0
28  4    6  0    0

Thank you for your answer. It solves, exactly, the problem I'd been struggling with. — Jonah M., Apr 19 '16 at 11:12
You've been incredibly helpful in the past, and so I wondered if you might have any insights on the following (slightly) modified question: http://stackoverflow.com/questions/41251553/data-cleaning-for-survival-analysis-using-a-participants-own-data-to-impute-val — Jonah M., Dec 20 '16 at 21:31

Data Cleaning for Survival Analysis

1 Answers1