3

I a have two sets of data of which I want to find a correlation. Although there is quite some scattering of data there's obvious a relation. I currently use numpy polyfit (8th order) but there is some "wiggling" of the line (especially at the beginning and the end) which is not appropriate. Secondly I don't think the fit is very well at the beginning of the line (the curve should be slightly steeper.

How can I get a best fit "spline" through these data points?

Data scatter with polyfit

My current code:

# fit regression line
regressionLineOrder = 8
regressionLine = np.polyfit(data['x'], data['y'], regressionLineOrder)
p = np.poly1d(regressionLine)
kevins_1
  • 1,268
  • 2
  • 9
  • 27
Yorian
  • 2,002
  • 5
  • 34
  • 60
  • This may be more of a [Cross Validated](https://stats.stackexchange.com/), but in any case those effects are natural to a polynomial fit. If you want a better curve you may need to use a more advanced regression technique; [scikit-learn](http://scikit-learn.org/stable/) provides several algorithms. [Gaussian processes](https://en.wikipedia.org/wiki/Gaussian_process) could be a good choice here, although it may be too much data to use it directly. – jdehesa Apr 25 '17 at 13:14

2 Answers2

4

Take a look at @MatthewDrury's answer for Why use regularisation in polynomial regression instead of lowering the degree?. It's simply fantastic and spot on. The most interesting bit comes in at the end when he starts talking about using a natural cubic spline to fit a regression in place of a regularized polynomial of degree 10. You could use the implementation of scipy.interpolate.CubicSpline to accomplish something very similar. There are a ton of classes for other spline methods contained in scipy.interpolate for similar methods.

Here is a simple example:

from scipy.interpolate import CubicSpline

cs = CubicSpline(data['x'], data['y'])
x_range = np.arange(x_min, x_max, some_step)
plt.plot(x_range, cs(x_range), label='Cubic Spline')
Community
  • 1
  • 1
Grr
  • 15,553
  • 7
  • 65
  • 85
  • 1
    `CubicSpline` does interpolation. The article talks about approximation. Do you know how to perform spline approximation in python? – Antony Hatchkins Jul 06 '22 at 03:11
0

There are some possible issues with your data set... from your plot of n (x,y) points, they are linked with straight lines; if you display points instead, should see the points density along your domain, and it's not evenly distributed as the lines are not. Let's say your domain is [xmin,xmax], an 8th order polynom is good for interpolation, but it wiggles because of the high order and also because the point density is oddly distributed. Polynoms are not good for extrapolation, once there are no control points outside your domain. You could fix that with a spline, a cubic natural spline will control the derivative at xmin and xmax, but to do that, you should sort your dataset (x axis) and take a subsample of the n points with rolling average as control points to the spline algoritm. If your problem has an analytical solution (a gaussian variogram, for instance, looks like your points distribution), just try optimizing the parameters (range and sill, for the gaussian variogram, for instance) to minimize error inside the domain and follow the assintotes outside.

ePuntel
  • 91
  • 3