scipy.curve_fit vs. numpy.polyfit different covariance matrices

Question

I am using Python 3.6 for data fitting. Recently, I came across the following problem and I’m lacking experience wherefore I’m not sure how to deal with this.

If I use numpy.polyfit(x, y, 1, cov=True) and scipy.curve_fit(lambda: x, a, b: a*x+b, x, y) on the same set of data points, I get nearly the same coefficients a and b. But the values of the covariance matrix of scipy.curve_fit are roughly half of the values from numpy.polyfit.

Since I want to use the diagonal of the covariance matrix to estimate the uncertainties (u = numpy.sqrt(numpy.diag(cov))) of the coefficients, I have three questions:

Which covariance matrix is the right one (Which one should I use)?
Why is there a difference?
What does it need to make them equal?

Thanks!

Edit:

import numpy as np
import scipy.optimize as sc

data = np.array([[1,2,3,4,5,6,7],[1.1,1.9,3.2,4.3,4.8,6.0,7.3]]).T

x=data[:,0]
y=data[:,1]

A=np.polyfit(x,y,1, cov=True)
print('Polyfit:', np.diag(A[1]))

B=sc.curve_fit(lambda x,a,b: a*x+b, x, y)
print('Curve_Fit:', np.diag(B[1]))

If I use the statsmodels.api, the result corresponds to that of curve_fit.

Welcome to Stack Overflow! Could you prepare for us two minimal examples where this becomes visible? Have a look at [mcve]. (Fitting 3 or 4 points should suffice) — kvantour, Aug 23 '18 at 08:04

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

I imagine it has something to do with this

593          # Some literature ignores the extra -2.0 factor in the denominator, but 
594          #  it is included here because the covariance of Multivariate Student-T 
595          #  (which is implied by a Bayesian uncertainty analysis) includes it. 
596          #  Plus, it gives a slightly more conservative estimate of uncertainty. 
597          if len(x) <= order + 2: 
598              raise ValueError("the number of data points must exceed order + 2 " 
599                               "for Bayesian estimate the covariance matrix") 
600          fac = resids / (len(x) - order - 2.0) 
601          if y.ndim == 1: 
602              return c, Vbase * fac 
603          else: 
604              return c, Vbase[:,:, NX.newaxis] * fac

As in this case len(x) - order is 4 and (len(x) - order - 2.0) is 2, that would explain why your values are different by a factor of 2.

This explains question 2. The answer to question 3 is likely "get more data.", as for larger len(x) the difference will probably be negligible.

Which formulation is correct (question 1) is probably a question for Cross Validated, but I'd assume it is is curve_fit as that is explicitly intended to calculate the uncertainties as you state. From the documentation

pcov : 2d array

The estimated covariance of popt. The diagonals provide the variance of the parameter estimate. To compute one standard deviation errors on the parameters use perr = np.sqrt(np.diag(pcov)).

While the comment in the code for polyfit above says its intetention is more for Student-T analysis.

score 2 · Answer 2 · answered Aug 23 '18 at 09:40

The two methods compute the covariance in a different way. I'm not exactly sure about the method used by polyfit, but curve_fit estimates the covariance matrix by inverting J.T.dot(J), where J is the jacobian of the model. By looking at the code of polyfit, it seems they invert lhs.T.dot(lhs), where lhs was defined as the Vandermonde matrix, although I have to admit I do not know the mathematical background of this second method.

Now, as to your question of which is correct, polyfit's code has the following comment:

# Some literature ignores the extra -2.0 factor in the denominator, but
#  it is included here because the covariance of Multivariate Student-T
#  (which is implied by a Bayesian uncertainty analysis) includes it.
#  Plus, it gives a slightly more conservative estimate of uncertainty.

Based on this, and your observation, it would seem that polyfit always gives a bigger estimate than curve_fit. This would make sence, because J.T.dot(J) is a first order approximation to the covariance matrix. So when in doubt, overestimating the error is always better.

However, if you know the measurement errors in your data I would recommend providing them too and calling curve_fit with absolute_sigma=True. From my own tests, doing that does match the analytical results one would expect, so I would be curious to see which of the two performs better when measurement errors are provided.

scipy.curve_fit vs. numpy.polyfit different covariance matrices

2 Answers2