Imputation for bounding NA observations using a linear approximation

Question

I would like to impute values for NA observations at the beginning of the array, using a linear approximation of the following two non-NA observations to extrapolate the missing value. Then do the same for the NA observations at the end of the array, using the preceding two non-NA observations.

A reproducible example of my df:

M=matrix(sample(1:9,10*10,T),10);M[sample(1:length(M),0.5*length(M),F)]=NA;dimnames(M)=list(paste(rep("City",dim(M)[1]),1:dim(M)[1],sep=""),paste(rep("Year",dim(M)[2]),1:dim(M)[2],sep=""))
    M

       Year1 Year2 Year3 Year4 Year5 Year6 Year7 Year8 Year9 Year10
City1     NA     4     5    NA     3    NA    NA    NA     5     NA
City2      6    NA     3     3    NA     4     6    NA    NA      7
City3     NA     7    NA     8     8    NA    NA     8    NA      5
City4      3     5     3    NA    NA     3     5     9     8      7
City5      4     6     6    NA    NA     8    NA     7     1     NA
City6     NA    NA    NA    NA     4    NA     8     3     6      7
City7      9     3    NA    NA    NA    NA    NA     4    NA     NA
City8      5     6     9     8     5    NA    NA     1     4     NA
City9     NA    NA     6    NA     3     3     8    NA     7     NA
City10    NA    NA    NA    NA    NA    NA    NA    NA    NA      1

idx=rowSums(!is.na(M))>=2 # Index of rows with 2 or more non-NA to run na.approx

library(zoo)
M[idx,]=t(na.approx(t(M[idx,]),rule=1,method="linear")) # I'm using t as na.approx works on columns

       Year1 Year2 Year3 Year4    Year5 Year6 Year7 Year8 Year9 Year10
City1     NA   4.0     5   4.0 3.000000  3.50   4.0   4.5     5     NA
City2    6.0   5.5     3   3.0 5.500000  4.00   6.0   6.0     6      7
City3    4.5   7.0     3   8.0 8.000000  3.50   5.5   8.0     7      5
City4    3.0   5.0     3   8.0 6.666667  3.00   5.0   9.0     8      7
City5    4.0   6.0     6   8.0 5.333333  8.00   6.5   7.0     1      7
City6    6.5   4.5     7   8.0 4.000000  6.75   8.0   3.0     6      7
City7    9.0   3.0     8   8.0 4.500000  5.50   8.0   4.0     5     NA
City8    5.0   6.0     9   8.0 5.000000  4.25   8.0   1.0     4     NA
City9     NA    NA     6   4.5 3.000000  3.00   8.0   7.5     7     NA
City10    NA    NA    NA    NA       NA    NA    NA    NA    NA      1

I would like to extrapolate the boundaries (for City1 and City9) using a linear approximation based on the two preceding/following observations. For example M[1,1] should be 3 and M[1,10] should be 5,5.

Do you know how I could do this?

G. Grothendieck · Answer 1 · 2015-05-11T18:57:10.260

In extrap, nlead is the number of leading NAs in input vector x. non.na is the subset of elements of x which are not NA. Return the input if there are no leading NA elements or if there are fewer than 2 non-NA elements. m is the slope of the first two non-NAs. Replace the first nlead elements of x with the extrapolation. Finally we apply extrap to each row of M using MM[] <- so column names are preserved and then reverse each row, repeat and reverse back:

library(zoo)

extrap <- function(x) {
    nlead <- which.min(x * 0) - 1
    non.na <- na.omit(x)
    if (length(nlead) == 0 || nlead == 0) || length(non.na) < 2) return(x)
    m <- diff(head(non.na, 2))      
    replace(x, seq_len(nlead), non.na[1] - nlead:1 * m)
}

nc <- ncol(M)

naApprox <- function(x) if (length(na.omit(x)) < 2) x else na.approx(x, na.rm = FALSE)
MM <- M
MM[] <- t(apply(MM, 1, naApprox))

MM[] <- t(apply(MM, 1, extrap)) # extraploate to fill leading NAs
MM[] <- t(apply(MM[, nc:1], 1, extrap))[, nc:1] # extrapolate to fill trailing NAs

giving:

> MM
       Year1 Year2    Year3    Year4    Year5    Year6    Year7    Year8    Year9    Year10
City1    3.0   4.0 5.000000 4.000000 3.000000 3.500000 4.000000 4.500000 5.000000  5.500000
City2    6.0   4.5 3.000000 3.000000 3.500000 4.000000 6.000000 6.333333 6.666667  7.000000
City3    6.5   7.0 7.500000 8.000000 8.000000 8.000000 8.000000 8.000000 6.500000  5.000000
City4    3.0   5.0 3.000000 3.000000 3.000000 3.000000 5.000000 9.000000 8.000000  7.000000
City5    4.0   6.0 6.000000 6.666667 7.333333 8.000000 7.500000 7.000000 1.000000 -5.000000
City6   -4.0  -2.0 0.000000 2.000000 4.000000 6.000000 8.000000 3.000000 6.000000  7.000000
City7    9.0   3.0 3.166667 3.333333 3.500000 3.666667 3.833333 4.000000 4.166667  4.333333
City8    5.0   6.0 9.000000 8.000000 5.000000 3.666667 2.333333 1.000000 4.000000  7.000000
City9    9.0   7.5 6.000000 4.500000 3.000000 3.000000 8.000000 7.500000 7.000000  6.500000
City10    NA    NA       NA       NA       NA       NA       NA       NA       NA  1.000000

Note We used this as M:

M <- structure(c(NA, 6L, NA, 3L, 4L, NA, 9L, 5L, NA, NA, 4L, NA, 7L, 
5L, 6L, NA, 3L, 6L, NA, NA, 5L, 3L, NA, 3L, 6L, NA, NA, 9L, 6L, 
NA, NA, 3L, 8L, NA, NA, NA, NA, 8L, NA, NA, 3L, NA, 8L, NA, NA, 
4L, NA, 5L, 3L, NA, NA, 4L, NA, 3L, 8L, NA, NA, NA, 3L, NA, NA, 
6L, NA, 5L, NA, 8L, NA, NA, 8L, NA, NA, NA, 8L, 9L, 7L, 3L, 4L, 
1L, NA, NA, 5L, NA, NA, 8L, 1L, 6L, NA, 4L, 7L, NA, NA, 7L, 5L, 
7L, NA, 7L, NA, NA, NA, 1L), .Dim = c(10L, 10L), .Dimnames = list(
    c("City1", "City2", "City3", "City4", "City5", "City6", "City7", 
    "City8", "City9", "City10"), c("Year1", "Year2", "Year3", 
    "Year4", "Year5", "Year6", "Year7", "Year8", "Year9", "Year10"
    )))

Update: Fixed.

Your program runs smoothly and does the job perfectly! – goclem May 12 '15 at 08:19 — goclem, May 12 '15 at 08:19

score 1 · Accepted Answer · answered May 11 '15 at 13:22

This gives you the first column with the linearly-extrapolated values filled in for NA. You can adapt for the last column.

firstNAfill <- function(x) {
  ans <- ifelse(!is.na(x[1]),
                x[1],
                ifelse(sum(!is.na(x))<2, NA,
                       2*x[which(!is.na(x[1, ]))[1]] - x[which(!is.na(x[1, ]))[2]]
                )
  )
  return(ans)
}


dat$Year1 <- unlist(lapply(seq(1:nrow(dat)), function(x) {firstNAfill(dat[x, ])}))

Result:

       Year1 Year2 Year3 Year4    Year5 Year6 Year7 Year8 Year9 Year10
City1    3.0   4.0     5   4.0 3.000000  3.50   4.0   4.5     5     NA
City2    6.0   5.5     3   3.0 5.500000  4.00   6.0   6.0     6      7
City3    4.5   7.0     3   8.0 8.000000  3.50   5.5   8.0     7      5
City4    3.0   5.0     3   8.0 6.666667  3.00   5.0   9.0     8      7
City5    4.0   6.0     6   8.0 5.333333  8.00   6.5   7.0     1      7
City6    6.5   4.5     7   8.0 4.000000  6.75   8.0   3.0     6      7
City7    9.0   3.0     8   8.0 4.500000  5.50   8.0   4.0     5     NA
City8    5.0   6.0     9   8.0 5.000000  4.25   8.0   1.0     4     NA
City9    7.5    NA     6   4.5 3.000000  3.00   8.0   7.5     7     NA
City10    NA    NA    NA    NA       NA    NA    NA    NA    NA      1

The function returns the first column's current value if not NA, NA if there aren't two values to extrapolate from, and the extrapolated value otherwise.

Imputation for bounding NA observations using a linear approximation

2 Answers2

Linked