How to calculate error in Polynomial Linear Regression?

Question

I am trying to calculate the error rate of the training data I'm using.

I believe I'm calculating the error incorrectly. The formula is as shown:

y is calculated as shown:

I am calculating this in the function fitPoly(M) at line 49. I believe I am incorrectly calculating y(x(n)), but I don't know what else to do.

Below is the Minimal, Complete, and Verifiable example.

import numpy as np
import matplotlib.pyplot as plt

dataTrain = [[2.362761180904257019e-01, -4.108125266714775847e+00],
[4.324296163702689988e-01,  -9.869308732049049127e+00],
[6.023323504115264404e-01,  -6.684279243433971729e+00],
[3.305079685397107614e-01,  -7.897042003779912278e+00],
[9.952423271981121200e-01,  3.710086310489402628e+00],
[8.308127402955634011e-02,  1.828266768673480147e+00],
[1.855495407116576345e-01,  1.039713135916495501e+00],
[7.088332047815845138e-01,  -9.783208407540947560e-01],
[9.475723071629885697e-01,  1.137746192425550085e+01],
[2.343475721257285427e-01,  3.098019704040922750e+00],
[9.338350584099475160e-02,  2.316408265530458976e+00],
[2.107903139601833287e-01,  -1.550451474833406396e+00],
[9.509966727520677843e-01,  9.295029459100994984e+00],
[7.164931165416982273e-01,  1.041025972594300075e+00],
[2.965557300301902011e-03,  -1.060607693351102121e+01]]

def strip(L, xt):
    ret = []
    for i in L:
        ret.append(i[xt])
    return ret

x1 = strip(dataTrain, 0)
y1 = strip(dataTrain, 1)

# HELP HERE

def getY(m, w, D):
    y = w[0]
    y += np.sum(w[1:] * D[:m])
    return y

# HELP ABOVE

def dataMatrix(X, M):
    Z = []
    for x in range(len(X)):
        row = []
        for m in range(M + 1):
            row.append(X[x][0] ** m)
        Z.append(row)
    return Z

def fitPoly(M):
    t = []
    for i in dataTrain:
        t.append(i[1])
    w, _, _, _ = np.linalg.lstsq(dataMatrix(dataTrain, M), t)
    w = w[::-1]
    errTrain = np.sum(np.subtract(t, getY(M, w, x1)) ** 2)/len(x1)
    print('errTrain: %s' % (errTrain))
    return([w, errTrain])

#fitPoly(8)

def plotPoly(w):
    plt.ylim(-15, 15)
    x, y = zip(*dataTrain)
    plt.plot(x, y, 'bo')
    xw = np.arange(0, 1, .001)
    yw = np.polyval(w, xw)
    plt.plot(xw, yw, 'r')

#plotPoly(fitPoly(3)[0])

def bestPoly():
    m = 0
    plt.figure(1)
    plt.xlim(0, 16)
    plt.ylim(0, 250)
    plt.xlabel('M')
    plt.ylabel('Error')
    plt.suptitle('Question 3: training and Test error')
    while m < 16:
        plt.figure(0)
        plt.subplot(4, 4, m + 1)
        plotPoly(fitPoly(m)[0])
        plt.figure(1)
        plt.plot(fitPoly(m)[1])
        #plt.plot(fitPoly(m)[2])
        m+= 1
    plt.figure(3)
    plt.xlabel('t')
    plt.ylabel('x')
    plt.suptitle('Question 3: best-fitting polynomial (degree = 8)')
    plotPoly(fitPoly(8)[0])
    print('Best M: %d\nBest w: %s\nTraining error: %s' % (8, fitPoly(8)[0], fitPoly(8)[1], ))

bestPoly()

Why do you believe that you are calculating the error incorrectly? — Kevin K., Oct 10 '17 at 17:25
@KevinK. "You should find that the test error is usually, if not always, greater than the training error. Also, the training error should decrease as M increases, reaching zero when M=15. The test error should tend to decrease initially and then begin to increase, becoming extremely large when M is near 15." The maximum error should also be below 250, according to the handout. — Andrew Raleigh, Oct 10 '17 at 17:29
Regards @AndrewRaleigh . Does `np.sum(w[1:] * D[:m])` supposed to work..? What is the function `getY` for? — , Oct 10 '17 at 17:33
I believe `getY` should be written more like `y = w[0] + np.sum([w[i+1]*D**i for i in range(m)])` with `D` being a single value instead of an array. — Kevin K., Oct 10 '17 at 17:37
@Arief getY is supposed to calculate y(x(n)), to fill in the formula. — Andrew Raleigh, Oct 10 '17 at 17:40
@KevinK. Currently D is DataTrain, what single value are you referring to? — Andrew Raleigh, Oct 10 '17 at 17:41
@AndrewRaleigh First, I think `w[1:] * D[:m]` will make an error. So you wanted `getY` to output all values of y(x) with `D` (training data) as the input domain..? I presume that `D` will be the values in the 1st column and t(n) will be values in the 2nd column of `dataTrain` ..? — , Oct 10 '17 at 17:46

score 2 · Answer 1 · 2017-10-10T19:44:40.593

2

I may contribute :

  def pol_y(x, w):
        y = 0; power = 0;
        for i in w:
            y += i*(x**power);
            power += 1;
        return y

The M is included implicitly because it is the final index of w. So if w = [0, 0, 1], then pol_y(x, w) is as same as f(x) = x^2.

If you want to map the 1st column of the dataTrain :

get_Y = [pol_y(i, w) for i in x1 ]

The error may be calculated by

vec_error = [(y1[i] - getY[i])**2 for i in range(0, len(y1)];
train_error = np.sum(vec_error)/len(y1);

Hope this helps.

edited Oct 10 '17 at 19:44

answered Oct 10 '17 at 18:05

1

I think your `pol_y` is the way `y(x)` is meant to be calculated, thank you! Any thoughts about `y(x(n))` though? Trying your get_Y results in errors magnitudes larger than I expected, so I'm not entirely sure that's it. – Andrew Raleigh Oct 10 '17 at 18:50
I have added something. So the vector/list `w` will be values of coefficients of the best fit polynomial, is it..? – Oct 10 '17 at 19:31

Jarad · Accepted Answer · 2017-10-10T19:25:51.540

Updated: This solution uses numpy's np.interp which will connect the points as a kind of "best fit". We then use your error function to find the difference between this interpolated line and the predicted y values for the degree of each polynomial.

import numpy as np
import matplotlib.pyplot as plt
import itertools

dataTrain = [
  [2.362761180904257019e-01, -4.108125266714775847e+00],
  [4.324296163702689988e-01,  -9.869308732049049127e+00],
  [6.023323504115264404e-01,  -6.684279243433971729e+00],
  [3.305079685397107614e-01,  -7.897042003779912278e+00],
  [9.952423271981121200e-01,  3.710086310489402628e+00],
  [8.308127402955634011e-02,  1.828266768673480147e+00],
  [1.855495407116576345e-01,  1.039713135916495501e+00],
  [7.088332047815845138e-01,  -9.783208407540947560e-01],
  [9.475723071629885697e-01,  1.137746192425550085e+01],
  [2.343475721257285427e-01,  3.098019704040922750e+00],
  [9.338350584099475160e-02,  2.316408265530458976e+00],
  [2.107903139601833287e-01,  -1.550451474833406396e+00],
  [9.509966727520677843e-01,  9.295029459100994984e+00],
  [7.164931165416982273e-01,  1.041025972594300075e+00],
  [2.965557300301902011e-03,  -1.060607693351102121e+01]
  ]

data = np.array(dataTrain)
data = data[data[:, 0].argsort()]

X,y = data[:, 0], data[:, 1]

fig,ax = plt.subplots(4, 4)
indices = list(itertools.product([0,1,2,3], repeat=2))
for i,loc in enumerate(indices, start=1):
  xx = np.linspace(X.min(), X.max(), 1000)
  yy = np.interp(xx, X, y)
  w = np.polyfit(X, y, i)
  y_pred = np.polyval(w, xx)
  ax[loc].scatter(X, y)
  ax[loc].plot(xx, y_pred)
  ax[loc].plot(xx, yy, 'r--')

  error = np.square(yy - y_pred).sum() / X.shape[0]
  print(error)

plt.show()

This prints out:

2092.19807848
1043.9400277
1166.94550318
252.238810889
225.798905379
155.785478366
125.662973726
143.787869281
6553.66570273
10805.6609259
15577.8686283
13536.1755299
108074.871771
213513916823.0
472673224393.0
1.01198058355e+12

Visually, it plots out this:

From here, it's just a matter of saving those errors to a list and finding the minimum.

Thanks @Jarad, your edit is right, as the spikes should get more pronounced as M > 8. Nonetheless, your code is great, is there a way to use `np.linalg.lstsq` instead of `polyfit`/`polyval`? — Andrew Raleigh, Oct 10 '17 at 18:48
I updated my answer. I don't think you can do this with `np.linalg.lstsq` because this not an (M x N) matrix; it's a single X value. But I could be wrong. — Jarad, Oct 10 '17 at 19:18
You can use the function dataMatrix to create the (M x N) matrix. I figured out how to do it using linalg. There's just one more thing if you don't mind. The error isn't supposed to be larger than 250, although the figures seem right they seem to be quite large? — Andrew Raleigh, Oct 10 '17 at 19:30
I think the numbers are substantially larger because I artificially made 1000 X datapoints in the `xx` variable to draw more smooth lines and to show the waves. Using plain X and y coordinates isn't enough to visualize the true polynomial line. So, in my opinion, the `error` being out putted is still relative to the data, just with more datapoints which adds more error when calculating the difference. One thing to try might be changing this number from 1000 to 100 or 1000000 and see if it gives a different "best". — Jarad, Oct 10 '17 at 20:16
Thank you for your reply. Though I get a lot of warning because of too high order ```polyfit```. — Martian, Dec 16 '20 at 17:15

How to calculate error in Polynomial Linear Regression?

2 Answers2