0

I am trying to make natural cubic spline using patsy library. Here is my code:

import numpy as np
from sklearn.linear_model import LinearRegression
from patsy import cr
import matplotlib.pyplot as plt

x = df.age #some data
y = df.wage

x_basis = cr(x, df=15)
model = LinearRegression().fit(x_basis, y)
y_hat = model.predict(x_basis)
plt.scatter(x, y)
plt.plot(x, y_hat, 'r')
plt.show()

The output is the following: enter image description here

I believe that there should be one line. How can I solve the issue?

Bruh
  • 25
  • 7

2 Answers2

3

This is just a plotting issue. The plt.plot function draws by default a line plot without markers. Since your data is not sorted by x variable, the line jumps back and forth making the result look messy. I generated sample data and plotted with plt.plot defaults and then by hiding the line, and adding markers. The results are

With default plt.plot arguments

Rest of the code below (not repeated for convenience)

spline_basis = patsy.cr(x, df=3)
model = LinearRegression().fit(spline_basis, y)
y_spline = model.predict(spline_basis)
plt.scatter(x, y)
plt.plot(x, y_spline, color="red")
plt.show()

enter image description here

With plt.plot when marker='.' and ls=''

Rest of the code below (not repeated for convenience)

spline_basis = patsy.cr(x, df=3)
model = LinearRegression().fit(spline_basis, y)
y_spline = model.predict(spline_basis)
plt.scatter(x, y)
plt.plot(x, y_spline, ls="", marker=".", color="red") # Only this changed
plt.show()

enter image description here

By reordering data

Rest of the code below (not repeated for convenience)
If you want to draw a line plot, you could just reorder the data for fitting, like this:

spline_basis = patsy.cr(x, df=3)
model = LinearRegression().fit(spline_basis, y)
y_spline = model.predict(spline_basis)
plt.scatter(x, y)
xsorted, ysorted = zip(*[(X, Y) for (X, Y) in sorted(zip(x, y_spline))]) # simple reordering
plt.plot(xsorted, ysorted, color="red")
plt.show()

enter image description here

By using new data for prediction

Usually, models are created for prediction. The idea is to create model with training data and then use the model with some new data. This new data can be anything. If it is sorted, it can be plotted as lineplot. In this case, you can create new x-values, for example

new_x = np.linspace(10, 100, 100)

and calculate the predicted y-values for them. For this, you need to know (and save) only few values. Actually, just df, lower_bound, upper_bound and the 4 floats from model._coef.

# Fit model
spline_basis = patsy.cr(x, df=3, lower_bound=x.min(), upper_bound=x.max())
model = LinearRegression().fit(spline_basis, y)
y_train = model.predict(spline_basis)

# Use model
new_x = np.linspace(10, 100, 100)  # 100 points
spline_basis_new = patsy.cr(new_x, df=3, lower_bound=x.min(), upper_bound=x.max())
new_y = model.predict(spline_basis_new)

plt.scatter(x, y)
plt.plot(x, y_train, color="red", ls="", marker=".")
plt.plot(new_x, new_y, color="green")
plt.show()

enter image description here

Rest of the code

from matplotlib import pyplot as plt
import numpy as np
import patsy
from sklearn.linear_model import LinearRegression


def dummy_data():
    np.random.seed(1)

    x = np.random.choice(np.arange(18, 81), size=4000)

    def model(x):
        a = 83 / 107520
        b = -895 / 5376
        c = 17747 / 1680
        d = -622 / 7
        return a * x ** 3 + b * x ** 2 + c * x + d

    def noisemodel(x):
        an = -0.0591836734693878
        bn = 5.25510204081633
        cn = -31.6326530612245
        return an * x ** 2 + bn * x + cn

    y = model(x)
    ynoise = np.array([np.random.randn() * noisemodel(_) for _ in x])

    return x, y + ynoise


x, y = dummy_data()
Niko Föhr
  • 28,336
  • 10
  • 93
  • 96
  • Regarding the new data points part, if I have only one x in production for example, how should I apply cr and then use the saved model? – mlee_jordan Feb 09 '21 at 10:36
  • 1
    You would need to save the `df`, `lower_bound` and `upper_bound` (three numbers) which you used with the training data to create new spline basis with [patsy.cr](https://patsy.readthedocs.io/en/latest/API-reference.html#patsy.cr). Saving the regression model created with sklearn [possible with pickle](https://scikit-learn.org/stable/modules/model_persistence.html). Although, if you have just a linear model, all you need to do is to save the `model._coef` and after that predictions are just matrix multiplication. – Niko Föhr Feb 09 '21 at 11:25
  • Many thanks @np8! But here are we still assuming x has multiple values so that we can indeed apply it like cr(x_new, df=10, constraints="center", lower_bound=x.min(), upper_bound=x.max())? Suppose my x_new is 150 and I'd like to get corresponding smoothed values [which has 1 x 10 dimension] so that I could use this in my linear regression. What am I missing? – mlee_jordan Feb 09 '21 at 18:08
  • I'm not on computer right now but would it work if you just created an np.array with just one value? If not, I guess it would be best to create a new question. I am not sure what kind of prerequisites the patsy.cr has for its inputs. – Niko Föhr Feb 09 '21 at 19:42
  • you can find my question at: https://stackoverflow.com/questions/66107014/how-to-use-the-smoothed-line-with-patsy-cr-in-production. Thanks! – mlee_jordan Feb 09 '21 at 20:01
-1

Bruh, sort your values before passing to a patsy function.

DSBA Piazza Teacher Assistant Team

Ilya Zisman
  • 31
  • 1
  • 6