How to obtain only one parameter set for a polynomial fit when several data sets are taken into account?

Question

This is a follow-up of this question. I have several data sets of sample points sharing the same x-coordinates and would now like to do a polynomial fit taking all this sample points into account. That means that I want to end up with one set of parameters that describes the data best.

I figured out how to pass several data sets (in my example below there are only 2) to the fitting function, however, I then obtain one parameter set per data set.

How do I obtain only one set of parameters that describes all my data sets best?

Here is my code and the output I am getting:

import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline


x = np.array([0., 4., 9., 12., 16., 20., 24., 27.])
y = np.array([[2.9, 4.3, 66.7, 91.4, 109.2, 114.8, 135.5, 134.2],
              [0.9, 17.3, 69.7, 81.4, 119.2, 124.8, 155.5, 144.2]])
y = y.T
# plt.plot(x,y[:, 0], 'ro', x,y[:,1],'bo')
# plt.show()

x_plot = np.linspace(0, max(x), 100)
X = x[:, np.newaxis]
X_plot = x_plot[:, np.newaxis]

plt.scatter(x, y[:, 0], label="training points 1", c='r')
plt.scatter(x, y[:, 1], label="training points 2", c='b')

for degree in np.arange(4, 5, 1):
    model = make_pipeline(PolynomialFeatures(degree), Ridge(alpha=3, fit_intercept=False))
    model.fit(X, y)
    y_plot = model.predict(X_plot)
    plt.plot(x_plot, y_plot, label="degree %d" % degree)

plt.legend(loc='lower left')

plt.show()

ridge = model.named_steps['ridge']
print(ridge.coef_)

As you can see, I get one curve per data set:

as well as two parameter sets:

[[ -4.09943033e-01  -1.86960613e+00   1.73923722e+00  -1.01704665e-01
    1.73567123e-03]
 [  4.19862603e-01   2.18343362e+00   8.37222298e-01  -4.18711046e-02
    5.69089912e-04]]

PS.: If the tool I am using is not the best suited one, I am also happy to get recommendations what I should use instead.

@jme: What do you mean? How would this "large" data set then look like? — Cleb, Nov 24 '15 at 18:05
I mean to simply stack all of the training vectors and labels. It looks like Jake's answer makes this precise. — jme, Nov 24 '15 at 18:18

score 2 · Accepted Answer · answered Nov 24 '15 at 18:12

You'll need to make your data into one single dataset. For example:

x_all = np.ravel(x + np.zeros_like(y))
y_all = np.ravel(y)

Here's a full example:

import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

x = np.array([0., 4., 9., 12., 16., 20., 24., 27.])
y = np.array([[2.9, 4.3, 66.7, 91.4, 109.2, 114.8, 135.5, 134.2],
              [0.9, 17.3, 69.7, 81.4, 119.2, 124.8, 155.5, 144.2]])

x_all = np.ravel(x + np.zeros_like(y))
y_all = np.ravel(y)

plt.scatter(x, y[0], label="training points 1", c='r')
plt.scatter(x, y[1], label="training points 2", c='b')

x_plot = np.linspace(0, max(x), 100)

for degree in np.arange(4, 5, 1):
    model = make_pipeline(PolynomialFeatures(degree), Ridge(alpha=3, fit_intercept=False))
    model.fit(x_all[:, None], y_all)
    y_plot = model.predict(x_plot[:, None])
    plt.plot(x_plot, y_plot, label="degree %d" % degree)

    ridge = model.named_steps['ridge']
    print(degree, ridge.coef_)

plt.legend(loc='best')

Output is

4 [  1.72754641e-03   1.36364501e-01   1.29300064e+00  -7.20932655e-02 1.15823050e-03]

How to obtain only one parameter set for a polynomial fit when several data sets are taken into account?

1 Answers1