2

I have a matrix of values and zeros, where zero= NA. The values are interspersed around the matrix, and what I want to do is interpolate the values of all the NA values. This is the data:

enter image description here

I'm trying to guess all of these values by taking all the known values in my matrix, and multiplying the value by the distance (such that the further away a point is, the less influence it has). This is what the interpolated result looks like: enter image description here

As you can see, this method is not very effective, it does affect the NAs nearest to the known values, but then they quickly converge onto an average value. I think this is due to the fact that it's taking the ENTIRE RANGE, which has many ups and downs... rather than just the points nearest to it.

Obviously, matrix operations aren't my specialty... what do I need to change to correctly do the linear-interpolation?

Here's the code:

library(dplyr)
library(plotly)

Cont <- structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                    0, 0, 0, 0, 0, 0, 0, 1816, 2320, 1406, 2028, 1760, 1932, 1630, 
                    1835, 1873, 1474, 1671, 2073, 1347, 2131, 2038, 1969, 2036, 1602, 
                    1986, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                    0, 0, 0, 0, 0, 2311, 1947, 2094, 1947, 2441, 1775, 1461, 1260, 
                    1494, 2022, 1863, 1587, 2082, 1567, 1770, 2065, 1404, 1809, 1972, 
                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                    0, 0, 0, 2314, 1595, 2065, 1870, 2178, 1410, 1994, 1979, 2111, 
                    1531, 1917, 1559, 2109, 1921, 1606, 1469, 1601, 1771, 1771), .Dim = c(19L, 
                                                                                       30L))

  ## First get real control values
  idx <- which(Cont > 0, arr.ind=TRUE)
  V <- Cont[idx]
  ControlValues <- data.frame(idx,V)

  ## Make data.frame of values to fill
  toFill <- which(Cont == 0, arr.ind=TRUE) %>% as.data.frame
  toFill$V <- 0

  ## And now figure out the weighted value of each point
  for (i in 1:nrow(toFill)){
    toFill[i,] -> CurrentPoint

    Xs <- (1/abs(CurrentPoint[,1] - ControlValues[,1])) 
    Xs[is.infinite(Xs)] <- 0
    Xs <- Xs/sum(Xs)/100

    Ys <- (1/abs(CurrentPoint[,2] - ControlValues[,2])) 
    Ys[is.infinite(Ys)] <- 0
    Ys <- Ys/sum(Ys)/100

    ControlValues1 <- data.frame(Xs,Ys)
    toFill[i,3] <- sum(rowMeans(ControlValues1) * ControlValues$V)*100
  }

  ## add back in the controls and reorder
  bind_rows(ControlValues,toFill) -> Both
  Both %>% arrange(row,col) -> Both

  ## and plot the new surface
  NewCont <- matrix(Both$V,max(Both$row),max(Both$col),byrow = T)
  plot_ly(z=NewCont, type="surface",showscale=FALSE) 
Amit Kohli
  • 2,860
  • 2
  • 24
  • 44
  • You won't be able to *interpolate* the values for `x < 10` because your data has no support there. If you are only interested in the interpolated values for the range `10 <= x <= 30`, you can use bi-linear interpolation. – aichao Jul 25 '16 at 16:31
  • A fair point. I'd like to interpolate AND extrapolate. Isn't what I did bi-linear interpolation? – Amit Kohli Jul 25 '16 at 16:54
  • I don't think your code does bi-linear interpolation. Also, just fair warning that extrapolation in your case will have little value since your data is so sparse. – aichao Jul 25 '16 at 17:50
  • I'd really appreciate to see a working example, as I mentioned, matrix operations aren't my strong point. Thanks! – Amit Kohli Jul 25 '16 at 17:57

1 Answers1

1

One approach to interpolate and extrapolate data in R is to use the akima package. The following performs bi-linear interpolation followed by extrapolation using as input the known data points in the data frame ControlValues to fill the zeroes in Cont.

library(akima)
library(plotly)

NewCont <- akima::interp(x=ControlValues[,1], y=ControlValues[,2], z=ControlValues[,3],
                         xo=1:nrow(Cont), yo=1:ncol(Cont), linear=TRUE)$z
NewCont[,1:9] <- akima::interp.old(x=ControlValues[,1], y=ControlValues[,2], 
                                   z=ControlValues[,3], xo=1:nrow(Cont), 
                                   yo=1:9, ncp=2, extrap=TRUE)$z

plot_ly(z=NewCont, type="surface",showscale=FALSE)

Notes:

  1. The first call to akima::interp performs the bi-linear interpolation. See the help page ?akima::interp for usage and details.

    • A key point is that the inputs x, y, and z for the known data points need not be on a x-y grid. In this case, these are the columns of ControlValues.
    • The output of akima::interp is a list whose z component is a matrix of interpolated values over the grid whose x and y coordinates are defined by the inputs xo and yo, respectively. In this case, these are just the row and column indices of Cont
    • As stated in the help page

    z-values for points outside the convex hull are returned as NA.

    In this case, the first nine columns of the output corresponding to yo=1:9 will be NAs.

  2. The second call to akima::interp (actually akima::interp.old) performs the data extrapolation to fill in the NAs left by the first call. See this SO quation/answer for the details of this usage.

The above approach gives the following result

NewCont

Another approach to perform bi-linear interpolation is to use the interp.surface function in the fields package. This approach is mentioned because the implementation is an R-script, which can be listed by typing the function name interp.surface at the R command line.

library(fields)

loc <- make.surface.grid(list(x=1:nrow(Cont), y=1:ncol(Cont)))
NewCont2 <- matrix(interp.surface(list(x=sort(unique(ControlValues[,1])),
                                       y=sort(unique(ControlValues[,2])),
                                       z=matrix(ControlValues[,3],
                                                nrow=length(unique(ControlValues[,1])),
                                                ncol=length(unique(ControlValues[,2])))),
                                  loc), nrow=nrow(Cont), ncol=ncol(Cont))
NewCont2[,1:9] <- akima::interp.old(x=ControlValues[,1], y=ControlValues[,2], 
                                    z=ControlValues[,3], xo=1:nrow(Cont), 
                                    yo=1:9, ncp=2, extrap=TRUE)$z

Here, the requirements are the opposite to those for akima::interp. Specifically, the known data points must lie on a x-y grid. However, the coordinates to interpolate need not be on a grid and is instead a matrix containing corresponding column vectors of x and y coordinates where each tuple (x[i],y[i]) is a x-y coordinate to interpolate. Since the data points in ControlValues are on a grid, these requirements are also satisfied for this case. See the help page ?interp.surface for usage and details.

Notes:

  1. sort(unique(ControlValues[,1])) and sort(unique(ControlValues[,2])) simply gives the x and y coordinates for the grid of known data points
  2. The z component in the list is simply the z values for the known data points reshaped as a matrix over the grid of known data points
  3. The matrix of coordinates to interpolate is generated by make.surface.grid using as x and y coordinates the row and column indices of Conf, respectively
  4. A coordinate to interpolate that lies outside the grid of known points will result in a interpolated value of NA
  5. interp.surface returns a vector of z values corresponding to the coordinates to interpolate. This is then rehaped to a matrix over the grid of coordinates to interpolate, which has dimensions nrow(Cont) by ncol(Cont)

Finally, it is easy to verify that the two approaches give the same result

print(max(abs(NewCont - NewCont2)))
##[1] 4.547474e-13
Community
  • 1
  • 1
aichao
  • 7,375
  • 3
  • 16
  • 18
  • Excellent answer. I was even going to ask about how to compare the interpolation methods, but it's great that you even showed that at the end. Wonderful! Incredible that the two methods are so close... I guess the interpolation algorithm is very similar? The other thing I'm noticing, which is a bit annoying is that the peaks and valleys are too extreme... If I'm interested in building a baseline control surface, maybe I should apply some smoothening first. Is there some correct way to do it? Or is it fair to just multiply all my values by `0.8` or so? Thanks again! – Amit Kohli Jul 26 '16 at 10:30
  • @AmitKohli sorry for the delay in getting back to you. For your first question, the short answer is that the algorithm in `akima` is more sophisticated because it does not require a grid of known points to interpolate. The algorithm in `fields`, however, is what people would normally refer to as bi-linear interpolation. This requires a grid of known points, which your data happens to satisfy. For the answer to your second question, see my next comment. – aichao Jul 28 '16 at 15:32
  • @AmitKohli the correct way all depends on the meaning of your data. If your data is known to be perfectly accurate, then you should interpolate (and can extrapolate). For additional smoothness, you may want to interpolate using a higher order function such as cubic splines. If your data is known to be noisy, then you want to fit some function to these data points that minimizes some error criterion. This is estimation. Here, the function can be linear or higher-order. The difference is that fitting will generally not preserve the values of the data represented by the estimated function. – aichao Jul 28 '16 at 15:41
  • this is exactly the kind of help I'm looking for. I can tell you a little bit more about this project... I'm trying to help my dad (a wheat breeder) use a better way of interpreting experimental wheat yields relative to control varieties that are interspersed throughout the field. Would you like to collaborate? I don't know if you're involved in development, but this could be a really useful tool for people who really need it! I would be very interested to hear your perspective looking at the real data. Interested in doing your part? :) https://github.com/mexindian/lineSelector – Amit Kohli Jul 29 '16 at 11:15