0

I want to run regression spline with B-spline basis function. The data is structured in such a way that the number of observations is less than the number of basis functions and I get a good result. But I`m not sure if this is the correct case.
Do I have to have more rows than columns like linear regression?
Thank you.

Bahareh
  • 1
  • 3

1 Answers1

0

When the number of observations, N, is small, it’s easy to fit a model with basis functions with low square error. If you have more basis functions than observations, then you could have 0 residuals (perfect fit to the data). But that is not to be trusted because it may not be representative of more data points. So yes, you want to have more observations than you do columns. Mathematically, you cannot properly estimate more than N columns because of collinearity. For a rule of thumb, 15 - 20 observations are usually needed for each additional variable / spline.

But, this isn't always the case, such as in genetics when we have hundreds of thousands of potential variables and small sample size. In that case, we turn to tools that help with a small sample size, such as cross validation and bootstrap.

Bootstrap (ie resample with replacement) your datapoints and refit splines many times (100 will probably do). Then you average the splines and use these as the final spline functions. Or you could do cross validation, where you train on a smaller dataset (70%) and then test it on the remaining dataset.

In the functional data analysis framework, there are packages in R that construct and fit spline bases (such as cubic, B, etc). These packages include refund, fda, and fda.usc.

For example,

B <- smooth.construct.cc.smooth.spec(object = list(term = "day.t", bs.dim = 12, fixed = FALSE, dim = 1, p.order = NA, by = NA),data = list(day.t = 200:320), knots = list())

constructs a B spline basis of dimension 12 (over time, day.t), but you can also use these packages to help choose a basis dimension.