9

I am fitting data with weights using scipy.odr but I don't know how to obtain a measure of goodness-of-fit or an R squared. Does anyone have suggestions for how to obtain this measure using the output stored by the function?

Ashley
  • 123
  • 1
  • 5

3 Answers3

8

The res_var attribute of the Output is the so-called reduced Chi-square value for the fit, a popular choice of goodness-of-fit statistic. It is somewhat problematic for non-linear fitting, though. You can look at the residuals directly (out.delta for the X residuals and out.eps for the Y residuals). Implementing a cross-validation or bootstrap method for determining goodness-of-fit, as suggested in the linked paper, is left as an exercise for the reader.

Robert Kern
  • 13,118
  • 3
  • 35
  • 32
  • 1
    how can I use the res_var value for some kind of test, i.e. if the data are very noisy, res_var will be high and the test should fail. But res_var is in absolute terms. How can one turn this into say, a p-value to test for significance? – Oliver Angelil Jun 04 '17 at 03:49
2

The output of ODR gives both the estimated parameters beta as well as the standard deviation of those parameters sd_beta. Following p. 76 of the ODRPACK documentation, you can convert these values into a t-statistic with (beta - beta_0) / sd_beta, where beta_0 is the number that you're testing significance with respect to (often zero). From there, you can use the t-distribution to get the p-value.

Here's a working example:

import numpy as np
from scipy import stats, odr


def linear_func(B, x):
    """
    From https://docs.scipy.org/doc/scipy/reference/odr.html
    Linear function y = m*x + b
    """
    # B is a vector of the parameters.
    # x is an array of the current x values.
    # x is in the same format as the x passed to Data or RealData.
    #
    # Return an array in the same format as y passed to Data or RealData.
    return B[0] * x + B[1]


np.random.seed(0)
sigma_x = .1
sigma_y = .15
N = 100
x_star = np.linspace(0, 10, N)
x = np.random.normal(x_star, sigma_x, N)
# the true underlying function is y = 2*x_star + 1
y = np.random.normal(2*x_star + 1, sigma_y, N)

linear = odr.Model(linear_func)
dat = odr.Data(x, y, wd=1./sigma_x**2, we=1./sigma_y**2)
this_odr = odr.ODR(dat, linear, beta0=[1., 0.])
odr_out = this_odr.run()
# degrees of freedom are n_samples - n_parameters
df = N - 2  # equivalently, df = odr_out.iwork[10]
beta_0 = 0  # test if slope is significantly different from zero
t_stat = (odr_out.beta[0] - beta_0) / odr_out.sd_beta[0]  # t statistic for the slope parameter
p_val = stats.t.sf(np.abs(t_stat), df) * 2
print('Recovered equation: y={:3.2f}x + {:3.2f}, t={:3.2f}, p={:.2e}'.format(odr_out.beta[0], odr_out.beta[1], t_stat, p_val))

Recovered equation: y=2.00x + 1.01, t=239.63, p=1.76e-137

One note of caution in using this approach on nonlinear problems, from the same ODRPACK docs:

"Note that for nonlinear ordinary least squares, the linearized confidence regions and intervals are asymptotically correct as n → ∞ [Jennrich, 1969]. For the orthogonal distance regression problem, they have been shown to be asymptotically correct as σ∗ → 0 [Fuller, 1987]. The difference between the conditions of asymptotic correctness can be explained by the fact that, as the number of observations increases in the orthogonal distance regression problem one does not obtain additional information for ∆. Note also that Vˆ is dependent upon the weight matrix Ω, which must be assumed to be correct, and cannot be confirmed from the orthogonal distance regression results. Errors in the values of wǫi and wδi that form Ω will have an adverse affect on the accuracy of Vˆ and its component parts. The results of a Monte Carlo experiment examining the accuracy of the linearized confidence intervals for four different measurement error models is presented in [Boggs and Rogers, 1990b]. Those results indicate that the confidence regions and intervals for ∆ are not as accurate as those for β.

Despite its potential inaccuracy, the covariance matrix is frequently used to construct confidence regions and intervals for both nonlinear ordinary least squares and measurement error models because the resulting regions and intervals are inexpensive to compute, often adequate, and familiar to practitioners. Caution must be exercised when using such regions and intervals, however, since the validity of the approximation will depend on the nonlinearity of the model, the variance and distribution of the errors, and the data itself. When more reliable intervals and regions are required, other more accurate methods should be used. (See, e.g., [Bates and Watts, 1988], [Donaldson and Schnabel, 1987], and [Efron, 1985].)"

alowet
  • 43
  • 4
  • I see that you used a linear function in this example. Would this be viable for non-linear functions as well? – mknote Jul 04 '21 at 02:01
  • Good question, from the docs it appears that this method offers a reasonable approximation, but can fail in certain non-linear cases. I updated the OP with information to this effect. – alowet Jul 05 '21 at 18:12
1

As mentioned by R. Ken, chi-square or variance of the residuals is one of the more commonly used tests of goodness of fit. ODR stores the sum of squared residuals in out.sum_square and you can verify yourself that out.res_var = out.sum_square/degrees_freedom corresponds to what is commonly called reduced chi-square: i.e. the chi-square test result divided by its expected value.

As for the other very popular estimator of goodness of fit in linear regression, R squared and its adjusted version, we can define the functions

import numpy as np

def R_squared(observed, predicted, uncertainty=1):
    """ Returns R square measure of goodness of fit for predicted model. """
    weight = 1./uncertainty
    return 1. - (np.var((observed - predicted)*weight) / np.var(observed*weight))

def adjusted_R(x, y, model, popt, unc=1):
    """
    Returns adjusted R squared test for optimal parameters popt calculated
    according to W-MN formula, other forms have different coefficients:
    Wherry/McNemar : (n - 1)/(n - p - 1)
    Wherry : (n - 1)/(n - p)
    Lord : (n + p - 1)/(n - p - 1)
    Stein : (n - 1)/(n - p - 1) * (n - 2)/(n - p - 2) * (n + 1)/n

    """
    # Assuming you have a model with ODR argument order f(beta, x)
    # otherwise if model is of the form f(x, a, b, c..) you could use
    # R = R_squared(y, model(x, *popt), uncertainty=unc)
    R = R_squared(y, model(popt, x), uncertainty=unc)
    n, p = len(y), len(popt)
    coefficient = (n - 1)/(n - p - 1)
    adj = 1 - (1 - R) * coefficient
    return adj, R

From the output of your ODR run you can find the optimal values for your model's parameters in out.beta and at this point we have everything we need for computing R squared.

from scipy import odr

def lin_model(beta, x):
    """
    Linear function y = m*x + q
    slope m, constant term/y-intercept q
    """
    return beta[0] * x + beta[1]

linear = odr.Model(lin_model)
data = odr.RealData(x, y, sx=sigma_x, sy=sigma_y)
init = odr.ODR(data, linear, beta0=[1, 1])
out = init.run()

adjusted_Rsq, Rsq = adjusted_R(x, y, lin_model, popt=out.beta)
btomelleri
  • 13
  • 4
  • 1
    All of these options, however, are useful only for linear regression. As someone performing non-linear regression, this is insufficient. What are the options in the non-linear case? – mknote Jul 04 '21 at 02:00
  • 1
    with this definition of R_squared, isn't this calculating error (residuals) in the sense of OLS which is what we don't want (where TLS is obtaining error from the minimum distance to the fit line) - shouldn't the r2 returned from ODR be insensitive to the order of input of x and y (x=x and y=y should be the same output as x=y and y=x)? – BML Jul 14 '21 at 05:27