4

I have a vector containing NA's at the boundary

x <- c(NA, -1, 1,-1, 1, NA, -1, 2, NA, NA)

I want the outcome to be:

c(-3, -1, 1,-1, 1, 0, -1, 2, 5, 8)

In other words, I want to fill both inner and boundary NA's with linear interpolation (maybe I cannot call it "inter-polation" since NA's are at boundaries).

I tried a function in the Package "zoo", na.fill(x, "extend"), but the boundary output is not something I want, which just repeats the leftmost or rightmost non-NA value:

na.fill(x,"extend")

and the output is

[1] -1 -1  1 -1  1  0 -1  2  2  2

I also checked other functions for filling NA, such as na.approx(), na.locf(), etc. but none of them works.

na.spline does work but the output of boundary NA's lead to an extremely large variation.

na.spline(x)

The output is:

 [1] -15.9475983  -1.0000000   1.0000000  -1.0000000   1.0000000   0.3400655  -1.0000000   2.0000000
 [9]  13.1441048  35.9323144

The boundary points are too large. Can anyone help me out? Thanks in advance!

Hongfei Li
  • 59
  • 6
  • 1
    Seems related: [Imputation for bounding NA observations using a linear approximation](https://stackoverflow.com/questions/30167674/imputation-for-bounding-na-observations-using-a-linear-approximation) – Henrik Nov 10 '19 at 17:54
  • I doubt there will be an existing answer that delivers your expectation. It appears that you want the means of flanking non-missing values for interior NA's and the differences of two adjacent non-missing values to be used to extend the NA positions at ends. This is not really a standard procedure, so I think you would need to specify the rules more carefully if you want an on-target response ... that is if my guess is correct about the rules to be used. – IRTFM Nov 10 '19 at 17:56
  • You are right. What I want is to fill NA existing both interior and exterior positions (i.e., boundary points). 1) For interior points, it is reasonable to use linear interpolation, which can be found in several packages. 2) However, for the exterior points, I failed to find a standard solution so I suggest using a method similar to "linear interpolation" with the information from the nearest two non-NA data points. This method is definitely not reasonable enough. – Hongfei Li Nov 10 '19 at 22:21
  • 1
    ...see also the link above: "_I would like to extrapolate the boundaries [..] using a linear approximation based on the two preceding/following observations_", with an answer by the author of `zoo`. – Henrik Nov 10 '19 at 22:40

4 Answers4

6

You can use na.spline() from the zoo library:

na.spline(x)

[1] 0.0 0.5 1.0 1.5 2.0 2.5

Data for the original question:

x <- c(0, NA, 1, NA, 2, NA)
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
2

Given the data and expected output after the question's edit, I believe the following function does it. It fills in the interior NA's with approxfun and then treats the extremes one by one.

na.extrapol <- function(y){
  x <- seq_along(y)
  f <- approxfun(x[!is.na(y)], y[!is.na(y)])
  y[is.na(y)] <- f(x[is.na(y)])
  r <- rle(is.na(y))
  if(r$values[1]){
    Y <- y[r$lengths[1] + 1:2]
    X <- seq_len(r$lengths[1])
    y[rev(X)] <- Y[1] - diff(Y)*X
  }
  n <- length(r$lengths)
  if(r$values[n]){
    s <- sum(r$lengths[-n])
    Y <- y[s - 1:0]
    X <- seq_len(r$lengths[n])
    y[s + X] <- Y[2] + diff(Y)*X
  }
  y
}

x <- c(NA, -1, 1,-1, 1, NA, -1, 2, NA, NA)
na.extrapol(x)
#[1] -3 -1  1 -1  1  0 -1  2  5  8

x2 <- c(NA, NA, -1, 1,-1, 1, NA, -1, 2, NA, NA)
na.extrapol(x2)
#[1] -5 -3 -1  1 -1  1  0 -1  2  5  8
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
2

Here is one way to do it:

First, we do linear approximation which will leave us with all tail NA-s from left and right:

x <- na.approx(x, method = "constant", f = 0.5,na.rm = F)

Now let's find non-NA vector and associated leftmost and rightmost arithmetic progression increment terms:

x_c <- x[!is.na(x)]
left <- x_c[1] - x_c[2]
right <- x_c[length(x_c)] - x_c[length(x_c) - 1]

Now it's time to fill left and right NA-s with the numbers obtained by arithmetic progression:

ind_x<- which(!is.na(x))
big_M <- 100

x[(ind_x[length(ind_x)]):length(x)] <- seq(x[ind_x[length(ind_x)]],
                                           sign(right) * big_M,
                                           right)[1:(length(x) - ind_x[length(ind_x)] + 1)]
x[ind_x[1]:1] <- seq(x[ind_x[1]],sign(left) * big_M,left)[1:ind_x[1]]
y <- x

where big_M is a user-defined big number that will not be exceeded by arithmetic progression, given the underlying data.

Input - Output:

x <- c(NA, -1, 1,-1, 1, NA, -1, 2, NA, NA)
> y
 [1] -3 -1  1 -1  1  0 -1  2  5  8

x <- c(NA,NA,NA, -1, 1,-1, 1, NA, -1, 2, NA, NA,NA)
> y
 [1] -7 -5 -3 -1  1 -1  1  0 -1  2  5  8 11

x <- c(NA,NA,NA, 5,1, 1,-1, 1, NA, -1, 2, NA, NA,NA)
> y
 [1] 17 13  9  5  1  1 -1  1  0 -1  2  5  8 11
Vitali Avagyan
  • 1,193
  • 1
  • 7
  • 17
0

Besides considering Hmisc::approxExtrap, another option is to use lm but it will most likely be slower than the other options here

x <- c(NA, -1, 1,-1, 1, NA, -1, 2, NA, NA)
DF <- data.frame(i=seq_along(x), x)
cc <- DF[complete.cases(DF),]
DF$x <- approx(cc$i, cc$x, DF$i)$y
hh <- head(cc, 2L)
tt <- tail(cc, 2L)
DF$x[DF$i < hh$i[1L]] <- predict(lm(x ~ i, hh), DF[DF$i < hh$i[1L], "i", drop=FALSE])
DF$x[DF$i > tt$i[2L]] <- predict(lm(x ~ i, tt), DF[DF$i > tt$i[2L], "i", drop=FALSE])
DF

output:

    i  x
1   1 -3
2   2 -1
3   3  1
4   4 -1
5   5  1
6   6  0
7   7 -1
8   8  2
9   9  5
10 10  8
chinsoon12
  • 25,005
  • 4
  • 25
  • 35