0

Does anybody know if there is an implementation of the Stata nl (nonlinear least squares) package in Python? I tried to use lmfit as well as optimize.leastsq from scipy, but both do not seem to work.

The equation for the regression is

Y = x1 + b1 + 0.3*log(x2-b2)*b3 - 0.7*x3*b3 + b5*x2

where Y is the dependent variable, the x's are the independent variables, and the b's are the coefficients to be estimated.

Using the lmfit package, i tried the following:


from lmfit import minimize, Parameters, Parameter, report_fit
import pandas as pd
import numpy as np

inputfile = "testdata.csv"
df = pd.read_csv(inputfile)

x1= df['x1']
x2 = df['x2']
x3= df['x3']
y= df['y']


def fcn2min(params, x1, x2, x3, y):

    b1 = params['b1'].value
    b2 = params['b2'].value
    b3 = params['b3'].value
    b5 = params['b5'].value
    model = x1 + b1 + (0.3)*np.log(x2-b2)*b3 - (0.7)*x3*b3 + b5*x2
    return model - y

params = Parameters()

params.add('b1', value= 10)

params.add('b2', value= 1990)

params.add('b3', value= 5)

params.add('b5', value= 12)


result = minimize(fcn2min, params, args=(x1, x2, x3, y))

print report_fit(result) 

As a result, all parameters are estimated to be NaN. Can anybody explain what I did wrong? Or, is there a good implementation of Stata's nl function in Python? Many thanks!

Here's the data in the CSV file:


x1,x2,x3,y
1981,15.2824955,14.56475067,2.936807632
1982,15.2635746,15.52343941,2.908272743
1983,15.30461597,16.30871582,2.940227509
1984,15.37490845,16.76519966,3.001846313
1985,15.41295338,17.04235458,3.030970573
1986,15.44680405,17.25271797,3.055702209
1987,15.48135281,17.44781876,3.081344604
1988,15.52259159,17.62217331,3.113491058
1989,15.5565939,17.71343422,3.138068199
1990,15.57392025,17.81187439,3.144176483
1991,15.57197666,17.89474106,3.128887177
1992,15.60479259,17.98217583,3.14837265
1993,15.63134575,18.06685829,3.161927223
1994,15.67116165,18.16578865,3.18959713
1995,15.69621944,18.27449799,3.202876091
1996,15.7329874,18.38712311,3.228042603
1997,15.77698135,18.50685883,3.260077477
1998,15.81788635,18.63579178,3.289312363
1999,15.86141682,18.76427078,3.321393967
2000,15.89737129,18.89691544,3.34650898
2001,15.90485096,18.99729347,3.344522476
2002,15.92070866,19.06253433,3.351119995


  • I don't know the `lmfit` module, but in your `result = ...` line, what are the values of x1, x2, x3 and y? From the code you've posted, I'd think you'd be getting `NameError: name 'x1' is not defined`. – rmunn Oct 01 '15 at 12:28
  • x1, x2, x3 and y are data – user2986671 Oct 01 '15 at 13:28
  • But where are they assigned? The code snippet doesn't show how they're being assigned. It's possible that they're the source of the NaNs you're seeing, or it could be something else -- but you haven't shown us all the code we'll need to see in order to help you. – rmunn Oct 01 '15 at 13:29
  • I've never used `pandas` either, so I'm only guessing -- but your CSV file looks strange to me. Is `y` the year variable? If so, shouldn't the first line be `y,x1,x2,x3`? But then you have what looks like a year (1990) in `b2`, which doesn't match up with the fact that your years are in the first column... I don't know. Sorry I can't be of any help; hopefully others will see this and be able to help you out. – rmunn Oct 01 '15 at 14:59
  • y is the outcome (dependent) variable; x2 is the year (the year is a independent, explanatory, variable here. thank you very much though! – user2986671 Oct 01 '15 at 16:02
  • I find it difficult to imagine that anyone would want to write an exact clone of a Stata command in another language. Rather it is easy to imagine that someone would want to write their own non-linear least squares routine with their own choices of exactly what it does and what syntax it should have, consistently with rules and practice for that language. – Nick Cox Oct 01 '15 at 16:51
  • The variable naming still doesn't match up with the csv header line. If x2 is the year and the `value` of b2 is the starting value for the estimation, then `np.log(x2-b2)` is nan for years on or before 1990. Make sure you have valid start parameters for any nonlinear fitting method. – Josef Oct 01 '15 at 23:09
  • @Nick Cox statsmodels has many parts that are closely modeled after the corresponding Stata or R functions (although not exact clones). However, `lmfit` comes from the engineering tradition in scientific Python, even though it moves closer to statistics with each new version. – Josef Oct 01 '15 at 23:21
  • Thanks for your help guys. Choosing a earlier initial guess for b2 doesnt help either. I will use Stata for now :) – user2986671 Oct 08 '15 at 12:14

1 Answers1

1

Just to set the record straight, the reason for the failure here is because you are not checking for the case that x2-b2 might be negative, so that np.log(x2-b2) is NaN. Certainly, if the objective function returns NaN, the fit is going to stop and not be able to find a good solution. You could try adding an upper bound on b2. Like others, I suspect that if you are guessing b1 to be 10 and b2 to be 1990 that you have some simple error in your objective function that is causing NaN to occur. It's often good to call the objective function once, and maybe even plot the starting condition.

Or you can blame the tool.

M Newville
  • 7,486
  • 2
  • 16
  • 29