0

I'm confused by the coefficients produced by the output of lm

Here's a copy of the data I'm working with

(postprocessed.csv)

"","time","value"
"1",1,2.61066016308988
"2",2,3.41246054742996
"3",3,3.8608767964033
"4",4,4.28686048552237
"5",5,4.4923132964825
"6",6,4.50557049744317
"7",7,4.50944447661246
"8",8,4.51097373134893
"9",9,4.48788748823809
"10",10,4.34603985656981
"11",11,4.28677073671406
"12",12,4.20065901625172
"13",13,4.02514194962519
"14",14,3.91360194972916
"15",15,3.85865748409081
"16",16,3.81318053258601
"17",17,3.70380706527433
"18",18,3.61552922363713
"19",19,3.61405310598722
"20",20,3.64591327503384
"21",21,3.70234435835577
"22",22,3.73503970503372
"23",23,3.81003078640584
"24",24,3.88201196162666
"25",25,3.89872518158949
"26",26,3.97432743542362
"27",27,4.2523675144599
"28",28,4.34654855854847
"29",29,4.49276038902684
"30",30,4.67830892029687
"31",31,4.91896819673664
"32",32,5.04350767355202
"33",33,5.09073406942046
"34",34,5.18510849382162
"35",35,5.18353176529036
"36",36,5.2210776270173
"37",37,5.22643491929207
"38",38,5.11137006553725
"39",39,5.01052467981257
"40",40,5.0361056705898
"41",41,5.18149486951409
"42",42,5.36334869132276
"43",43,5.43053620818444
"44",44,5.60001072279525

I have fitted a 4th order polynomial to this data using the following script:

library(ggplot2)
library(matrixStats)
library(forecast)

df_input <- read.csv("postprocessed.csv")

x <- df_input$time
y <- df_input$value
df <- data.frame(x, y)

poly4model <- lm(y~poly(x, degree=4), data=df)

v <- seq(30, 40)
vv <- poly4model$coefficients[1] +
  poly4model$coefficients[2] * v +
  poly4model$coefficients[3] * (v ^ 2) +
  poly4model$coefficients[4] * (v ^ 3) +
  poly4model$coefficients[5] * (v ^ 4)

pdf("postprocessed.pdf")
plot(df)
lines(v, vv, col="red", pch=20, lw=3)
dev.off()

I initially tried using the predict function to do this, but couldn't get that to work, so resorted to implementing this "workaround" using some new vectors v and vv to store the data for the line in the region I am trying to plot.

Ultimatly, I am trying to do this:

  • Fit a 4th order polynomial to the data
  • Plot the 4th order polynomial over the range of data in one color
  • Plot the 4th order polynomial over the range from the last value to the last value + 10 (prediction) in a different color

At the moment I am fairly sure using v and vv to do this is not "the best way", however I would have thought it should work. What is happening is that I get very large values.

Here is a screenshot from Desmos. I copied and pasted the same coefficients as shown by typing poly4model$coefficients into the console. However, something must have gone wrong because this function is nothing like the data.

I think I've provided enough info to be able to run this short script. However I will add the pdf as well.

desmos

pdf

Dave2e
  • 22,192
  • 18
  • 42
  • 50
FreelanceConsultant
  • 13,167
  • 27
  • 115
  • 225

1 Answers1

2

It is easiest to use the predict function to create your line. To do that, you pass the model and a data frame with the desired independent variables to the predict function.

x <- df_input$time
y <- df_input$value
df <- data.frame(x, y)

poly4model <- lm(y~poly(x, degree=4), data=df)

v <- seq(30, 40)
#Notice the column in the dataframe is the same variable name 
#     as the variable in the model!
predict(poly4model, data.frame(x=v))

plot(df)
lines(v, predict(poly4model, data.frame(x=seq(30, 40))), col="red", pch=20, lw=3)

enter image description here

NOTE
The function poly "Returns or evaluates orthogonal polynomials of degree 1 to degree over the specified set of points x: these are all orthogonal to the constant polynomial of degree 0." To return the "normal" polynomial coefficients one needs to use the "raw=TRUE" option in the function.

poly4model <- lm(y~poly(x, degree=4, raw=TRUE), data=df)

Now your equation above will work.

Dave2e
  • 22,192
  • 18
  • 42
  • 50
  • Thanks for this... Couple of questions: I see that `seq` was still required to generate a sequence which is stored in `v`. Presumably there is no way to get rid of that extra step? I was pretty sure `vv` would not be required, but I thought perhaps an additional vector `v` was uneccessary also. Second question: Is the call to `predict` required twice? It seems like it is called inside the `lines` function as well as by itself. Finally: I didn't understand the note which has been added at the end: Why is `raw=TRUE` required? (Or is it required?) Thanks for the help thus far! – FreelanceConsultant Mar 13 '21 at 15:19
  • An additional question actually, sorry to overwhelm with the many questions: If I want to predict a range outside of the x axis range, is that possible? For example, can I plot from 30 to 60 somehow? – FreelanceConsultant Mar 13 '21 at 15:20
  • There is a lot to unpack. Yes, define the dataframe with the `seq` function to avoid defining "v". The first call `predict` was for demonstration and I could have removed it. Concerning `raw=TRUE` please see the poly help page. The poly function can generate the coefficients for either the orthogonal polynomials or standard polynomials (what most people are use to). This is deeper math than I really care to know. For second comment, Yes you can predict for any value, but should you? I wouldn't predict beyond the limits of the data for anything greater than a 2nd order – Dave2e Mar 13 '21 at 15:48
  • 1
    A fair comment - the purpose of predicting outside of the range is to demonstrate that polynomials are generally very bad at predicting beyond the constraining range of data. (Because they blow up) – FreelanceConsultant Mar 13 '21 at 15:52