0

I've read a related post on manually calculating R-squared values after using scipy.optimize.curve_fit(). However, they calculate an R-squared value when their function follows the power-law (f(x) = a*x^b). I'm trying to do the same but get negative R-squared values.

Here is my code:

def powerlaw(x, a, b):
    '''Generic power law function.'''
    return a * x**b
X = s_lt[4:] # independent variable (Pandas series)
Y = s_lm[4:] # dependent variable (Pandas series)
popt, pcov = curve_fit(powerlaw, X, Y)
residuals = Y - powerlaw(X, *popt)
ss_res = np.sum(residuals**2)   # residual sum of squares
ss_tot = np.sum((Y-np.mean(Y))**2)  # total sum of squares
r_squared = 1 - (ss_res / ss_tot)   # r-squared value
print("R-squared of power-law fit = ", str(r_squared))

I got an R-squared value of -0.057....

From my understanding, it's not good to use R-squared values for non-linear functions, but I expected to get a much higher R-squared value than a linear model due to overfitting. Did something else go wrong?

Kevin Trinh
  • 55
  • 1
  • 5

1 Answers1

0

See The R-squared and nonlinear regression: a difficult marriage?. Also When is R squared negative?.

Basically, we have two problems:

  1. nonlinear models do not have an intercept term, at least, not in the usual sense;
  2. the equality SStot=SSreg+SSres may not hold.

The first reference above denotes your statistic "pseudo-R-square" (in the case of non-linear models), and notes that it may be lower than 0.

To further understand what's going on you probably want to plot your data Y as a function of X, the predicted values from the power law as a function of X, and the residuals as a function of X.

For non-linear models I have sometimes calculated the sum of squared deviation from zero, to examine how much of that is explained by the model. Something like this:

pred = powerlaw(X, *popt)
ss_total = np.sum(Y**2)  # Not deviation from mean.
ss_resid = np.sum((Y - pred)**2)
pseudo_r_squared = 1 - ss_resid/ss_total

Calculated this way, pseudo_r_squared can potentially be negative (if the model is really bad, worse than just guessing the data are all 0), but if pseudo_r_squared is positive I interpret it as the amount of "variation from 0" explained by the model.

TMBailey
  • 557
  • 3
  • 14