1

I'm building a linear regression model where one of the input variables is number of sales. Rather than using the number of sales per day as a linear input, I want to use some form of cubic spline transformation (because it tends to tail off after a set point, and the relationship before this isn't linear). The question I have is:

I believe I can create cubic splines for this variable like so on my training dataset (and then build a linear model using these) like so:

transformed_x = dmatrix("bs(data, knots=(2000, 3000, 4000), degree=3, include_intercept=False)", {"data": df['Sales_Volume']},return_type='dataframe')

But for making predictions for a single new data point, say for 5000 sales, how can I use these same splines to make a prediction on my fitted model?

If I try to just create another transformed version of transformed_x for the single data point of 5000 sales I get an error saying:

ValueError: some knot values ([2000 3000 4000]) fall below lower bound (5000)

It works if I have a large new dataset to predict that covers the range of all of those knots, but now I'm not sure if I can be confident that making the same transformation on a new dataset will yield correct results?

DB_DS
  • 29
  • 2
  • 1
    Using a cubic spline for regression sounds like a recipe for over-fitting. – Mark Ransom Jun 18 '20 at 18:25
  • I have the same problem. The only way I found to manage it is to add dummy min and max values to my data (to have all knot values inside the range) and to ignore the results of prediction from dummy values. I'm wondering if there is a better solution. – manu190466 Nov 17 '20 at 14:55
  • Adding dummy min and max values is the same approach that I took in the end. It seems to work, but I also wonder if a better solution exists. In response to the previous comment, the reason for creating a cubic spline was to enable me to use this in a coxPH survival model. – DB_DS Nov 18 '20 at 15:16

0 Answers0