3

I'm having an odd problem with doing polynomial regression in Excel. As many have before, I'm trying to get the correct coefficients that Excel is using when it creates a polynomial trend line on a graph. I've read how to do it using LINEST, and I have been able to get answers that match the trend line formula when doing 2nd- and 3rd-order regression...but when I try 4th or 5th the answers differ wildly from what Excel is showing me on the trend line formula.

Here's the data with the chart and my attempt at the 5th order regression:Click for Excel Workbook

Anyone have any ideas on what might be causing me trouble?

Thanks!

-Jon

Vadaar
  • 43
  • 2
  • 4
  • Just some additional detail I just noticed: when I do a 4th or 5th order regression the 4th and 5th coefficients are zero. If I actually run the formula the prediction is the same as the 3rd order regression (even though the individual coefficients are different). – Vadaar Apr 09 '15 at 13:11
  • Seems to be a problem with your Excel. If I download your file and open it, then the result of the `LINEST` is `-1.06569E-14 1.0973E-09 -3.01287E-05 0 0 210463020.6`. Now I change the value in `B2` to 34 and back to 35. The sheet calculates and shows now: `1.52874E-09 -0.000314866 25.94049994 -1068559.361 22008335073 -1.81315E+14`what is nearly exact what the chart shows. – Axel Richter Apr 09 '15 at 14:36
  • Of course since you are using dates as x values (lowest is 40980), x^5 is really a big number. There the precision limitations of Excel will be touched. – Axel Richter Apr 09 '15 at 14:43
  • If large numbers are the problem (i.e. dates), you can use the `STANDARDIZE` function to normalize the values before regression. – Byron Wall Apr 09 '15 at 15:06
  • Thanks everyone - I figured it out around the same time everyone responded! I tried it in R as well and had the same problem, so it doesn't seem inherent to Excel. Might be a memory addressing issue? I'll leave that to someone else to determine. – Vadaar Apr 09 '15 at 15:07
  • As @AxelRichter says, the problem is using comparatively large serial date numbers for X values combined with the limited precision double floating point number format that Excel uses internally. See below for a formula to work around this limitation. – lori_m Apr 18 '15 at 09:18

2 Answers2

2

The zero values in the output are a consequence of (multi-)collinearity. From MS function help:

"...The LINEST function checks for collinearity and removes any redundant X columns from the regression model when it identifies them. Removed X columns can be recognized in LINEST output as having 0 coefficients in addition to 0 se values...."

For more accurate estimates apply LINEST with mean-centered x-values and post-multiply by the matrix of binomial coefficients. So in place of:

=LINEST(B2:B31,A2:A31^{1,2,3,4,5})

try instead:

=MMULT(LINEST(B2:B31,(A2:A31-AVERAGE(A2:A31))^{1,2,3,4,5}),IFERROR(COMBIN({5;4;3;2;1;0},{5,4,3,2,1,0})*(-AVERAGE(A2:A31))^({5;4;3;2;1;0}-{5,4,3,2,1,0}),0))

which agree with the trendline values.

Also see: https://newtonexcelbach.wordpress.com/2011/02/04/fitting-high-order-polynomials/ (post and comments)

lori_m
  • 5,487
  • 1
  • 18
  • 29
0

You may be overfitting; in fact, I'd say you probably are if you're using a 5th order polynomial. With polynomials higher than order 3 (even higher than order 1 sometimes!) you are taking a risk that your "trend line" will shoot off into outer space at the end points. When you get past a certain level of complexity you are fitting the model to random variations in the data points rather than the underlying mechanism that generated the data.

One way to get avoid the problem is to use a regularization mechanism like ridge regression. Christopher Bishop's book Pattern Recognition and Machine Learning has an excellent discussion on the topic and gives a good example of how to regularize a polynomial regression.

BTW, I wouldn't use Excel for statistical analysis. Here is an article on Excel's statistical capabilities from Computational Statistics and Data Analysis journal (I first saw a link to this on another stackoverflow post; but I don't remember where; apologies to the original poster for not giving appropriate credit): http://www.pages.drexel.edu/~bdm25/excel2007.pdf

Taiko
  • 19
  • 3
  • This looks quite useful indeed, but it should have been a comment, not an answer, since it only addresses whether to attempt this in the first place without solving the immediate issue. With a bit more rep, [you will be able to post comments](http://stackoverflow.com/privileges/comment). – Nathan Tuggy Jun 03 '15 at 00:22