Imputation for longitudinal data using observation before and after missing data

Question

I’m in the process of cleaning some longitudinal data and I have several missing cases. I am trying to use an imputation that incorporates observations before and after the missing case. I’m wondering how I can go about addressing the issues detailed below.

I’ve been trying to break the problem apart into smaller, more manageable operations and objects, however, the solutions I keep coming to force me to use conditional formatting based on rows immediately above and below the a missing value and, quite frankly, I’m at a bit of a loss as to how to do this. I would love a little guidance if you think you know of a good technique I can use, experiment with, or if you know of any good search terms I can use when looking up a solution.

The details are below:

#Fake dataset creation
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
ss <- c(1,3,2,3,NA,0,0,2,4,0,NA,0,0,0,4,1,2,4,2,3,NA,2,1,0,NA,NA,0,0)
mydat <- data.frame(id, time, ss)

*Bold characters represent changes from the dataset above

The goal here is to find a way to get the mean of the value before (3) and after (0) the NA value for ID #1 (variable ss) so that the data look like this:
1,3,2,3,1.5,0,0,

ID# 2 (variable ss) should look like this:
2,4,0,0,0,0,0

ID #3 (variable ss) should use a last observation carried forward approach, so it would need to look like this:
4,1,2,4,2,3,3

ID #4 (variable ss) has two consecutive NA values and should not be changed. It will be flagged for a different analysis later in my project. So, it should look like this:
2,1,0,NA,NA,0,0 (no change).

greengrass62 · Accepted Answer · 2016-05-12T01:53:31.527

I use a package, smwrBase, the syntax for only filling in 1 missing value is below, but doesn't address id.

smwrBase::fillMissing(ss, max.fill=1)

The zoo package might be more standard, same issue though.

zoo::na.approx(ss, maxgap=1)

Below is an approach that accounts for the variable id. Current interpolation approaches dont like to fill in the last value, so i added a manual if stmt for that. A bit brute force as there might be a tapply approach out there.

> id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
> time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
> ss <- c(1,3,2,3,NA,0,0,2,4,0,NA,0,0,0,4,1,2,4,2,3,NA,2,1,0,NA,NA,0,0)
> mydat <- data.frame(id, time, ss, ss2=NA_real_)
> for (i in unique(id)) {
+   # interpolate for gaps
+   mydat$ss2[mydat$id==i] <- zoo::na.approx(ss[mydat$id==i], maxgap=1, na.rm=FALSE)
+   # extension for gap as last value
+   if(is.na(mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])])) {
+     mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])] <-
+       mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])-1]
+   }
+ }
> mydat
   id time ss ss2
1   1    0  1 1.0
2   1    1  3 3.0
3   1    2  2 2.0
4   1    3  3 3.0
5   1    4 NA 1.5
6   1    5  0 0.0
7   1    6  0 0.0
8   2    0  2 2.0
9   2    1  4 4.0
10  2    2  0 0.0
11  2    3 NA 0.0
12  2    4  0 0.0
13  2    5  0 0.0
14  2    6  0 0.0
15  3    0  4 4.0
16  3    1  1 1.0
17  3    2  2 2.0
18  3    3  4 4.0
19  3    4  2 2.0
20  3    5  3 3.0
21  3    6 NA 3.0
22  4    0  2 2.0
23  4    1  1 1.0
24  4    2  0 0.0
25  4    3 NA  NA
26  4    4 NA  NA
27  4    5  0 0.0
28  4    6  0 0.0

The interpolated value in id=1 is 1.5 (avg of 3 and 0), id=2 is 0 (avg of 0 and 0, and id=3 is 3 (the value preceding since it there is no following value).

Jonah M. ... i didn't understand/realize the significance of id in your sample daa. so my above solution is only partly helpful. — greengrass62, May 12 '16 at 00:54
Thanks, Greengrass, I really really appreciate your help. Unfortunately, this isn't quite what I'm looking for. The issue is, that I'm not simply trying to interpolate missing values. I need to be able to interpolate based on the value immediately preceding it and immediately following it. Essentially I need each missing value to be an average of the value that preceded it and the value that followed it. See example ID 1 for a good illustration of this. — Jonah M., May 12 '16 at 01:37
Jonah, I looked at id 1 and I get a value of 1.5 which is what I thought was called for. what am i missing? — greengrass62, May 12 '16 at 01:54
Greengrass, this is awesome. I think my error was in adapting it to my real data, I had some issues that were related to typos. This works like a charm. Thank you! — Jonah M., May 12 '16 at 11:29

Imputation for longitudinal data using observation before and after missing data

1 Answers1