11

I know how to do a basic polynomial regression in R. However, I can only use nls or lm to fit a line that minimizes error with the points.

This works most of the time, but sometimes when there are measurement gaps in the data, the model becomes very counter-intuitive. Is there a way to add extra constraints?

Reproducible Example:

I want to fit a model to the following made up data (similar to my real data):

x <- c(0, 6, 21, 41, 49, 63, 166)
y <- c(3.3, 4.2, 4.4, 3.6, 4.1, 6.7, 9.8)
df <- data.frame(x, y)

First, let's plot it.

library(ggplot2)
points <- ggplot(df, aes(x,y)) + geom_point(size=4, col='red')
points

Made up points

It looks like if we connected these points with a line, it would change direction 3 times, so let's try fitting a quartic to it.

lm <- lm(formula = y ~ x + I(x^2) + I(x^3) + I(x^4))
quartic <- function(x)  lm$coefficients[5]*x^4 + lm$coefficients[4]*x^3 + lm$coefficients[3]*x^2 + lm$coefficients[2]*x + lm$coefficients[1]

points + stat_function(fun=quartic)

Non-intuitive Model

Looks like the model fits the points pretty well... except, because our data had a large gap between 63 and 166, there is a huge spike there which has no reason to be in the model. (For my actual data I know that there is no huge peak there)

So the question in this case is:

  • How can I set that local maximum to be on (166, 9.8)?

If that's not possible, then another way to do it would be:

  • How can I limit the y-values predicted by the line from becoming larger than y=9.8.

Or perhaps there's a better model to be using? (Other than doing it piece-wise). My purpose is to compare features of models between graphs.

Yang Li
  • 462
  • 1
  • 8
  • 21
  • 2
    To get a quartic polynomial fit added to your plot, you can also add this to you `ggplot` code: `geom_smooth(method="lm", se=FALSE, formula=y ~ poly(x,4))`. – eipi10 Dec 09 '15 at 03:45
  • @eipi10 Thanks for the tip! It might not solve the problem but it makes the code a lot cleaner :) – Yang Li Dec 09 '15 at 03:49
  • 1
    I'm sure there's a way to create a constrained polynomial fit, but for now, another option is to use local regression. For example: `geom_smooth(colour="red", se=FALSE, method="loess")`. `loess` is the default method when you have small numbers of points, so you can drop the `method` argument if you wish. – eipi10 Dec 09 '15 at 04:08
  • @slickrickulicious The main reason is comparing gradients, local maximum/minimums from different graphs. The reason I want to get rid of that huge peak is because my data shows the increase in biomarkers over time, and the last value should be the highest – Yang Li Dec 09 '15 at 06:34
  • 1
    Who knows what goes on in that gap?! That is the best quartic polynomial fit, but if you want to find a happy medium between a polynomial fit and a linear fit you could add a few extra data points - evenly spaced, linearly interpolated - and use the `weights` argument to reduce their influence. – Gregor Thomas Dec 11 '15 at 02:40
  • 2
    You should probably model all your data (from all graphs) with one model (e.g., a GAM or GAMM). It would be preferable by far if you had a model based on the science behind your data. I wouldn't use a polynomial above second degree for modelling data (except when the science says it should follow such a polynomial). – Roland Dec 11 '15 at 12:27
  • Can you provide a larger example dataset? You're fitting a polynomial of degree 4 to 7 datapoints (yielding 2 DF). I suggest a localized method such as splines or loess without more data. – alexwhitworth Dec 14 '15 at 18:20

3 Answers3

9

The spline type of function will match your data perfectly (but doesn't for prediction purpose). Spline curves are widely used in CAD areas and sometime it just fits data point in mathematics and may be lack of physics meaning compared with regression. More info in here and a great background introduction in here.

The example(spline) will show you lots of fancy examples, and actually I use one of them.

Furthermore, it will be more reasonable to sample more data points and then fit by lm or nls regression for prediction.

The sample code:

library(splines)

x <- c(0, 6, 21, 41, 49, 63, 166)
y <- c(3.3, 4.2, 4.4, 3.6, 4.1, 6.7, 9.8)

s1 <- splinefun(x, y, method = "monoH.FC")

plot(x, y)
curve(s1(x), add = TRUE, col = "red", n = 1001)

enter image description here

Another approach I can thought is to constraint the range of parameters in regression so that you can get the predicted data in your expected range.

A very simple code with optim in below, but only a choice.

dat <- as.data.frame(cbind(x,y))
names(dat) <- c("x", "y")

# your lm 
# lm<-lm(formula = y ~ x + I(x^2) + I(x^3) + I(x^4))

# define loss function, you can change to others 
 min.OLS <- function(data, par) {
      with(data, sum((   par[1]     +
                         par[2] *  x + 
                         par[3] * (x^2) +
                         par[4] * (x^3) +
                         par[5] * (x^4) +   
                         - y )^2)
           )
 }

 # set upper & lower bound for your regression
 result.opt <- optim(par = c(0,0,0,0,0),
                min.OLS, 
                data = dat, 
                lower=c(3.6,-2,-2,-2,-2),
                upper=c(6,1,1,1,1),
                method="L-BFGS-B"
  )

 predict.yy <- function(data, par) {
               print(with(data, ((
                    par[1]     + 
                    par[2] *  x +
                    par[3] * (x^2) +
                    par[4] * (x^3) + 
                    par[5] * (x^4))))
                )
  }

  plot(x, y, main="LM with constrains")
  lines(x, predict.yy(dat, result.opt$par), col="red" )

enter image description here

Patric
  • 2,063
  • 17
  • 18
  • Accepted and +100. While I did not get the exact answer I wanted, the splines solution was the most effective out of all the solutions in the answers. The `method` parameter was particularly useful – Yang Li Dec 17 '15 at 20:33
3

I would go for local regression as eipi10 suggested. However, if you want to have polynomial regression, you may try to minimize penalized sum of squares.

Here is an example where the function is penalized for deviating "too much" from straight line:

library(ggplot2)
library(maxLik)
x <- c(0, 6, 21, 41, 49, 63, 166)/100
y <- c(3.3, 4.2, 4.4, 3.6, 4.1, 6.7, 9.8)
df <- data.frame(x, y)
points <- ggplot(df, aes(x,y)) + geom_point(size=4, col='red')

polyf <- function(par, x=df$x) {
   ## the polynomial function
   par[1]*x + par[2]*x^2 + par[3]*x^3 + par[4]*x^4 + par[5]
}
quarticP <- function(x) {
   polyf(par, x)
}
## a evenly distributed set of points, penalize deviations on these
grid <- seq(range(df$x)[1], range(df$x)[2], length=10)

objectiveF <- function(par, kappa=0) {
   ## Calculate penalized sum of squares: penalty for deviating from linear
   ## prediction
   PSS <- sum((df$y - polyf(par))^2) + kappa*(pred1 - polyf(par))^2 
   -PSS
}

## first compute linear model prediction
res1 <- lm(y~x, data=df)
pred1 <- predict(res1, newdata=data.frame(x=grid))
points <- points + geom_smooth(method='lm',formula=y~x)
print(points)

## non-penalized function
res <- maxBFGS(objectiveF, start=c(0,0,0,0,0))
par <- coef(res)
points <- points + stat_function(fun=quarticP, col="green")
print(points)

## penalty
res <- maxBFGS(objectiveF, start=c(0,0,0,0,0), kappa=0.5)
par <- coef(res)
points <- points + stat_function(fun=quarticP, col="yellow")
print(points)

The result with penalty 0.5 looks looks like this: penalized sum of squares line (yellow), linear regression (blue) You can adjust the penalty, and grid, the locations where the deviations are penalized.

Ott Toomet
  • 1,894
  • 15
  • 25
1

Ott Toomets source did not work for me, there were some errors. Here is a corrected version (without using ggplot2):

library(maxLik)
x <- c(0, 6, 21, 41, 49, 63, 166)/100
y <- c(3.3, 4.2, 4.4, 3.6, 4.1, 6.7, 9.8)
df <- data.frame(x, y)

polyf <- function(par, x=df$x) {
  ## the polynomial function
  par[1]*x + par[2]*x^2 + par[3]*x^3 + par[4]*x^4 + par[5]
}
quarticP <- function(x) {
  polyf(par, x)
}
## a evenly distributed set of points, penalize deviations on these
grid <- seq(range(df$x)[1], range(df$x)[2], length=10)

objectiveF <- function(par, kappa=0) {
  ## Calculate penalized sum of squares: penalty for deviating from linear
  ## prediction
  PSS <- sum((df$y - polyf(par))^2) + kappa*(pred1 - polyf(par, x=grid))^2 
  -PSS
}

plot(x,y, ylim=c(0,10))

## first compute linear model prediction
res1 <- lm(y~x, data=df)
pred1 <- predict(res1, newdata=data.frame(x=grid))
coefs = coef(res1)
names(coefs) = NULL
constant = coefs[1]
xCoefficient = coefs[2]
par = c(xCoefficient,0,0,0,constant)

curve(quarticP, from=0, to=2, col="black", add=T)


## non-penalized function
res <- maxBFGS(objectiveF, start=c(0,0,0,0,0))
par <- coef(res)
curve(quarticP, from=0, to=2, col="red", add=T)

## penalty
res2 <- maxBFGS(objectiveF, start=c(0,0,0,0,0), kappa=0.5)
par <- coef(res2)
curve(quarticP, from=0, to=2, col="green", add=T)
Fabian Werner
  • 957
  • 11
  • 19