Instability in fitting data using Scipy Optimize library

Question

I am attempting to fit a function to a set of data I have. The function in question is:

x(t) = - B + sqrt(AB(t-t0) + (x0 + B)^2)

I have tried to fit my data (included at the bottom) to this using two different methods but have found that whatever I do the fit for B is extremely unstable. Changing either the method or the initial guess wildly changes the output value. In addition, when I look at the error for this fit using curve_fit the error is almost two orders of magnitude higher than the value. Does anyone have some suggestions on what I should do to decrease the error?

import numpy as np
import scipy.optimize as spo

def modelFun(t,A,B):
    return -B + np.sqrt(A*B*(t-t0) + np.power(x0 + B,2))
    
def errorFun(k,time,data):
    A = k[0]
    B = k[1]
    return np.sum((data-modelFun(time,A,B))**2)
    
data = np.genfromtxt('testdata.csv',delimiter=',',skip_header = 1)
time = data[:,0]
xt = data[:,1]
t0 = data[0,0]
x0 = data[0,1]

minErrOut = spo.minimize(errorFun,[1,1000],args=(time,xt),bounds=((0,None),(0,None)))
(curveOut, curveCovar) = spo.curve_fit(modelFun,time,xt,p0=[1,1000],method='dogbox',bounds=([-np.inf,0],np.inf))

print('minimize result: A={}; B={}'.format(*minErrOut.x))
print('curveFit result: A={}; B={}'.format(*curveOut))
print('curveFit Error: A={}; B={}'.format(*np.sqrt(np.diag(curveCovar))))

Datafile:

Time,x
201,2.67662
204,3.28159
206,3.44378
208,3.72537
210,3.94826
212,4.36716
214,4.65373
216,5.26766
219,5.59502
221,6
223,6.22189
225,6.49652
227,6.799
229,7.30846
231,7.54229
232,7.76517
233,7.6209
234,7.89552
235,7.94826
236,8.17015
237,8.66965
238,8.66965
239,8.8398
240,8.88856
241,9.00697
242,9.45075
243,9.51642
244,9.63483
245,9.63483
246,10.07861
247,10.02687
248,10.24876
249,10.31443
250,10.47164
251,10.99502
252,10.92935
253,11.0995
254,11.28358
255,11.58209
256,11.53035
257,11.62388
258,11.93632
259,11.98806
260,12.26269
261,12.43284
262,12.60299
263,12.801
264,12.99502
265,13.08557
266,13.25572
267,13.32139
268,13.57114
269,13.76617
270,13.88358
271,13.83184
272,14.10647
273,14.27662
274,14.40796

jlandercy · Accepted Answer · 2022-08-17T06:14:59.203

TL;DR;

Your dataset is linear and misses observations at larger timescale. Therefore you can capture A which is proportional to the slope while your model needs to keep B large (and potentially unbounded) to inhibit the square root trend.

This can be confirmed by developing Taylor series of your model and analyzing the MSE surface associated to the regression.

In a nutshell, considering this kind of dataset and the given model, accept A don't trust B.

MCVE

First, let's reproduce your problem:

import io
import numpy as np
import pandas as pd
from scipy import optimize

stream = io.StringIO("""Time,x
201,2.67662
204,3.28159
...
273,14.27662
274,14.40796""")
data = pd.read_csv(stream)

# Origin Shift:
data = data.sub(data.iloc[0,:])
data = data.set_index("Time")

# Simplified model:
def model(t, A, B):
    return -B + np.sqrt(A*B*t + np.power(B, 2))

# NLLS Fit:
parameters, covariance = optimize.curve_fit(model, data.index, data["x"].values, p0=(1, 1000), ftol=1e-8)
# array([3.23405915e-01, 1.59960168e+05])
# array([[ 3.65068730e-07, -3.93410484e+01],
#        [-3.93410484e+01,  9.77198860e+12]])

The adjustment is fair:

But as you noticed model parameters differ from several order of magnitude which can prevent optimization to perform properly.

Notice that your dataset is quite linear. The observed effect is not surprising and is inherent the chosen model. B parameter must be several orders of magnitude bigger than A to keep the linear behaviour.

This claim is supported by the analysis of the first terms of the Taylor series:

def taylor(t, A, B):
    return (A/2*t - A**2/B*t**2/8) 

parameters, covariance = optimize.curve_fit(taylor, data.index, data["x"].values, p0=(1, 100), ftol=1e-8)
parameters
# array([3.23396685e-01, 1.05237134e+09])

Without surprise the slope of your linear dataset can be captured while the parameter B can be arbitrary large and will cause float arithmetic errors during optimization (hence the minimize warning bellow you have got).

Analyzing Error Surface

The problem can be reformulated as a minimization problem:

def objective(beta, t, x):
    return np.sum(np.power(model(t, beta[0], beta[1]) - x, 2))

result = optimize.minimize(objective, (1, 100), args=(data.index, data["x"].values))

#      fun: 0.6594398116927569
# hess_inv: array([[8.03062155e-06, 2.94644208e-04],
#       [2.94644208e-04, 1.14979735e-02]])
#      jac: array([2.07304955e-04, 6.40749931e-07])
#  message: 'Desired error not necessarily achieved due to precision loss.'
#     nfev: 389
#      nit: 50
#     njev: 126
#   status: 2
#  success: False
#        x: array([3.24090627e-01, 2.11891188e+03])

If we plot the MSE associated to your dataset, we get the following surface:

We have a canyon that is narrow on A space but seems unbounded at least for first decades on B space. This is supporting your observations in your post and comments. It also brings a technical insight on why we cannot fit B properly.

Performing the same operation on synthetic dataset:

t = np.linspace(0, 1000, 100)
x = model(t, 0.35, 20)
data = pd.DataFrame(x, index=t, columns=["x"])

To have the square root shape in addition of the linear trend at the beginning.

result = optimize.minimize(objective, (1, 0), args=(data.index, data["x"].values), tol=1e-8)

#      fun: 1.9284246829733202e-10
# hess_inv: array([[ 4.34760333e-05, -4.42855253e-03],
#       [-4.42855253e-03,  4.59219063e-01]])
#      jac: array([ 4.35726463e-03, -2.19158602e-05])
#  message: 'Desired error not necessarily achieved due to precision loss.'
#     nfev: 402
#      nit: 94
#     njev: 130
#   status: 2
#  success: False
#        x: array([ 0.34999987, 20.000013  ])

This version of the problem has following MSE surface:

Showing a potential convex valley around the known solution which explain why you are able to fit both parameters when there are sufficient large time acquisition.

Notice the valley is strongly stretched meaning that in this scenario your problem will benefit from normalization.

Hi thanks for the help! The ``x`` and ``B`` have units of length (m), ``t`` is time (seconds) and ``A`` is a velicty (length/time). This model is the result of a theoretical prediction and fits other data sets which have more time points. I suspect that this may be the fundamental problem, which is that given the non-linearity of the model there just is not sufficient resolution in the time data to accurately fit the model. I would expect ``A`` to be around the fitted value, on the order of 0.1-1. On the other hand ``B`` should be closer to 20. — JamesVR, Aug 16 '22 at 18:59
Thanks for the precision. Another problem in this fit is the product of parameters A*B which may confuse the optimisation algorithm as their modification both affect the main term. I will have a deeper look on it tomorrow. I'll keep you updated. — jlandercy, Aug 16 '22 at 19:27
@JamesVR, I have completed my answer with evidences on why it does not and will not work for this dataset. Let me know if it addresses your original post. Cheers. — jlandercy, Aug 17 '22 at 02:55
This makes a lot of sense and is far more in-depth than I was expecting. Thank you very much for your help! I'm glad that the error was in the quality of the dataset rather than a deep problem with the programming - far easier to go gather some more data than track down a nefarious bug. — JamesVR, Aug 17 '22 at 20:42

Instability in fitting data using Scipy Optimize library

1 Answers1

TL;DR;

MCVE

Analyzing Error Surface