Polynomial Regression values generated too far from the coordinates

Question

As per the the below code for Polynomial Regression coefficients value, when I calculate the regression value at any x point. Value obtained is way more away from the equivalent y coordinate (specially for the below coordinates). Can anyone explain why the difference is so high, can this be minimized or any flaw in understanding. The current requirement is not a difference of more 150 at every point.


import  numpy as np
x=[0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100]
y=[0,885,3517,5935,8137,11897,10125,13455,14797,15925,16837,17535,18017,18285,18328,18914,19432,19879,20249,20539,20746]                                                                                                     
z=np.polyfit(x,y,3)
print(z)

I have also tried various various codes available in java, but the coefficient values are same every where for this data. Please help with the understanding.
For example


0.019168 * N^3 + -5.540901 * N^2 + 579.846493 * N + -1119.339450

N equals 5 Value equals 1643.76649
Y value 885
N equals 10 Value equals 4144.20338
Y value 3517
N equals 100; Value=20624.29985
Y value 20746

Hi neer, welcome on StackOverflow! Would be nice if you shared what else have you tried (if anything) or what are your hyphoteses on the problem. It indeed seems that with a polynomial regression of degree 100 you match your residual target. With an higher degree you probably would match the target for all x and y. Additionally, since the minimization problem involves no stochastic computations, results will be the same whatever software you use. — Nicg, Jan 10 '20 at 08:09

Wololo · Answer 1 · 2020-01-10T11:06:15.290

2

The polynomial fit performs as expected. There is no error here, just a great deviation in your data. You might want to rescale your data though. If you add the parameter full=True to np.polyfit, you will receive additional information, including the residuals which essentially is the sum of the square fit errors. See this other SO post for more details.

import matplotlib.pyplot as plt
import  numpy as np

x = [0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100]
y = [0,885,3517,5935,8137,11897,10125,13455,14797,15925,16837,17535,18017,18285,18328,18914,19432,19879,20249,20539,20746]

m = max(y)
y = [p/m for p in y] # rescaled y such that max(y)=1, and dimensionless

z, residuals, rank, sing_vals, cond_thres = np.polyfit(x,y,3,full=True)

print("Z: ",z) # [ 9.23914285e-07 -2.67082878e-04  2.79497972e-02 -5.39544708e-02]

print("resi:", residuals) # 0.02188 : quite decent, depending on WHAT you're measuring ..

Z = [z[3] + q*z[2] +  q*q*z[1] + q*q*q*z[0] for q in x]

fig = plt.figure()
ax = fig.add_subplot(111)

ax.scatter(x,y)
ax.plot(x,Z,'r')
plt.show()

edited Jan 10 '20 at 11:06

answered Jan 10 '20 at 09:47

Wololo

1,249
1
13
25

Can you run the below code, it gives the following values which are way away from the coordinates. My question is whether we can fix this difference if not why: for i in x: print(p(i)) – neer Jan 10 '20 at 10:00
It print the values :-1119.339450404673 1643.766421983809 4144.202868832156 6396.345534455371 8414.57006316846 10213.252099286421 11806.767287124265 13209.49127099699 14435.799695219604 15500.068204107105 16416.672441974504 17199.988053136796 17864.390681908993 18424.25597260609 18893.9595695431 19287.87711703502 19620.384259396855 19905.856640943606 20158.66990599029 20393.19969885189 20623.821663843428 20158.66990599029 – neer Jan 10 '20 at 10:02
@neer Yes, of course they are far away! Your data points have values values higher 15000. Difference requirements of 150 (as you write) is less than 1% deviation!! That is harsh, and is probably way to strict... obviously heavily dependent on what you are actually measuring? – Wololo Jan 10 '20 at 10:09
I would suggest having some sort of relative difference requirement, and not an absolute one. I've rescaled the data in the plot to get a relative view. This is cleaner, and dimensionless. – Wololo Jan 10 '20 at 10:10
@neer I've updated the answer to include the fitting error/residual. I suggest taking a look into its (uncommonly good) [wiki article](https://en.wikipedia.org/wiki/Errors_and_residuals). If my answer helped you solve your issue, I'd appreciate if you *accepted* it, and if not, I might suggest that you rephrase your question, maybe on [Math Stack Exchange](https://math.stackexchange.com/). – Wololo Jan 10 '20 at 11:34
Upvoted for high quality answer. @magnus please see my answer which, though analytically different, agrees with your answer here. – James Phillips Jan 10 '20 at 14:38
@neer It is impossible to comment on the differences on your two data sets without actually having detailed insight on how the data i produced. In terms of scientific polynomial fitting, nothing here is wrong. The only difference is that the original data set has lower precision, or higher residuals/error than the one you post here in the comments. This should be expected and is perfectly normal for empirical data. Something might have influenced the experiment: maybe the external temperature changed abruptly, or charger connection was temporarily lost during charging. Only you can say. – Wololo Jan 13 '20 at 11:47
... If you don't trust the data, you can discard them and reproduce them. A tip could be to do multiple measurements and average them. Such an approach can greatly reduce noise in empirical data. – Wololo Jan 13 '20 at 11:48

James Phillips · Answer 2 · 2020-01-10T14:41:05.477

2

After I reviewed the answer of @Magnus, I reduced the limits used for the data in a 3rd order polynomial. As you can see, the points within my crudely drawn red circle cannot both lie on a smooth line with the nearby data. While I could fit smooth lines such as a Hill sigmoidal equation through the data, the data variance (noise) itself appears to be the limiting factor in achieving a peak absolute error of 150 with this data set.

edited Jan 10 '20 at 14:41

answered Jan 10 '20 at 14:36

James Phillips

4,526
3
13
11

1

Yes, indeed. There is no way you would be able to fit this empirical data to a third degree polynomial, *within the error limit* that OP requests. What @neer really should be doing, is investigating whether or not the error (deviation of each point from the polyfit) follows a trend: if it does, then there's a high probability that a third deg polynomial is **not** the best representation. Else, he should investigate whether (e.g.) [Peirce's criterion](https://en.wikipedia.org/wiki/Peirce%27s_criterion) is able to remove one or both of these two "trouble points" as outliers. – Wololo Jan 12 '20 at 10:38
1

I am not able to post the data in tabular form but this is the sequence of difference in values for point: -1119.33945 758.76649 627.20338 461.34722 277.57401 -1683.74025 1681.78044 -245.48792 -361.16933 -424.88779 -420.2673 -334.93186 -152.50547 139.38787 566.12416 374.0794 188.62959 27.15073 -89.98118 -145.39014 -121.70015 So there is no pattern in the difference. Also i don't see if there are only two points with the issue exists. – neer Jan 12 '20 at 12:28
1

After reducing the degree also i see the difference is still not steady: -299.927724 1086.531101 592.701276 179.582801 -150.824324 -2172.520099 1204.495476 -653.777599 -657.339324 -580.189699 -420.328724 -179.756399 143.527276 547.522301 1043.228676 862.646401 616.775476 308.615901 -55.832324 -473.569199 -941.594724 – neer Jan 12 '20 at 12:29
@neer if you look st the errors at the curve endpoints, far from the two example points I have highlighted, you will see values outside of the "150 limit". Even if you remove the two circled points from the regression, that limit cannot be reached with a smooth curve. This means that the noise in the data set will not allow that limit on this data. – James Phillips Jan 12 '20 at 12:38
@neer .. and perhaps more importantly, **why** is it not good enough? Can you shed some light on your error requirements? – Wololo Jan 12 '20 at 13:29

Polynomial Regression values generated too far from the coordinates

2 Answers2

Linked