4

I have a kernel function like so:

x <- 1:100
y <- rnorm(100, mean=(x/2000)^2)
plot(x,y)
kernel <- ksmooth(x,y, kernel="normal", bandwidth=10)
print(kernel$y)

If I try to predict at a point outside of the range of x values, it will give me NaN, because it is attempting to extrapolate beyond the data:

x <- 1:100
y <- rnorm(100, mean=(x/2000)^2)
plot(x,y)
kernel <- ksmooth(x,y, kernel="normal", bandwidth=10, x.points=c(130))
print(kernel$y)

> print(kernel$y)
[1] NA

Even when I change range.x it doesn't budge:

x <- 1:100
y <- rnorm(100, mean=(x/2000)^2)
plot(x,y)
kernel <- ksmooth(x,y, kernel="normal", bandwidth=10, range.x=c(1,200) , x.points=c(130))
print(kernel$y)

> print(kernel$y)
[1] NA

How do I get the ksmooth function the extrapolate beyond the data? I know this is a bad idea in theory, but in practice this issue comes up all the time.

makansij
  • 9,303
  • 37
  • 105
  • 183
  • I suppose a side-question would be: *What does `range.x` do?* The documentation seems to indicate that it is the "the range of points to be covered in the output." But it doesn't seem to have an affect on the output here? – makansij Apr 25 '16 at 07:03
  • 1
    Have you considered using bkde2D {KernSmooth} instead. By default this extrapolates 1.5x the bandwidth in each direction. Extrapolate further at your peril, and be sure you understand what you are doing and it's limitations. If you really need to extrapolate, then perhaps you should consider fitting the data to a model that has some relationship to reality and forms a more reasonable basis for extrapolation. – dww Apr 30 '16 at 12:04
  • I'll give that a try. The issue is that it is really difficult to run some kind of k-folds cross-validation when there is no extrapolation, because any time I split the kth fold into train and test, some of the test examples are inevitably outside of the range of the training examples. Am I the only one frustrated by this ? – makansij Apr 30 '16 at 22:57
  • Let us know if this works better for you. Note that the documentation for ksmooth even states "This function is implemented purely for compatibility with S, although it is nowhere near as slow as the S function. Better kernel smoothers are available in other packages." – dww May 01 '16 at 14:52

1 Answers1

2

To answer your side question, looking at the code of ksmooth, range.x is only used when x.points is not provided so that explains why you do not see it used. Let's look at the code in ksmooth:

function (x, y, kernel = c("box", "normal"), bandwidth = 0.5, 
    range.x = range(x), n.points = max(100L, length(x)), x.points) 
{
    if (missing(y) || is.null(y)) 
        stop("numeric y must be supplied.\nFor density estimation use density()")
    kernel <- match.arg(kernel)
    krn <- switch(kernel, box = 1L, normal = 2L)
    x.points <- if (missing(x.points)) 
        seq.int(range.x[1L], range.x[2L], length.out = n.points)
    else {
        n.points <- length(x.points)
        sort(x.points)
    }
    ord <- order(x)
    .Call(C_ksmooth, x[ord], y[ord], x.points, krn, bandwidth)
}

From this we see that we need to not provide x.points to make sure that range.x is used. If you run:

x <- 1:100
y <- rnorm(100, mean=(x/2000)^2)
plot(x,y)
kernel <- ksmooth(x,y, kernel="normal", bandwidth=10, range.x=c(1,200))
plot(kernel$x, kernel$y)

Now you'll see that your kernel is evaluated beyond 100 (although not up to 200). Increasing the bandwidth parameter allows you to get even further away from 100.

Clecocel
  • 96
  • 2