how to improve the quality of a nonlinear fit with python GEKKO?

Question

I am working on a biochemical model: there is an enzyme that catalyzes twice a substrate. By naming:
* E = the enzyme
* S = the original substrate
* P = the intermediate product, which is in turn substrate
* F = the final product
I have this reactions schema:
S+E <-> SE -> E + P <-> EP -> E + F
Named A the first catalysis reaction and B the second one, I have a total of 6 kinetic coefficients that are:
* ka = formation of substrate+enzyme complex (S + E -> SE)
* kar = dissolution of that complex (SE -> S +E) (the inverse reaction)
* kcata = catalytic coefficient (SE -> S + P)
and the analogous kb, kbr, kcatb
I have also two experimental datasets, in which the time course of the three species S, P, and F have been recorded, but each species has been sampled at different times and with a different number of points (the average size of each sample is 12 points). The two sets correspond to two different initial Enzyme concentrations. Then I have two sets of bi-dimensional arrays like S_E1[t_i, concentration_t_i], P_E1[t_i, concentration_t_i], F_E1[t_i, concentration_t_i] (where the t_i are different for S, P and F), and the analogous S_E2, P_E2, F_E2. The time is acquired with the accuracy of 1 s, in a range 0-60,000 s; for instance, the P_E1 first element looks like (t_i= 43280, conc.= 21.837), but the measurements are sparse in that range.
I would like to dynamically fit the differential equations system to obtain the values of the 6 coefficients (the various ks), but I have met several problems:
1. if I set m.time=np.linspace(0,60000,1), the program always crashes with a “memory fault”, independently of the solver I can choose, even though the Obj function computes the squared errors minimization only on a total of 72 points;
2. to bypass this problem, I have re-discretized the time in 100 s-intervals; so the experimental concentration values are reported as if they would have been acquired at the closest 100-integer s with respect to the real time: this can induce an error on the fit, but I hope this would be negligible; then I declare m.time= np.linspace(0,60000,101), and map all the arrays accordingly to the new timescale;
3. in this case the program works only when APOPT or IPOPT solver are used (BPOPT always returns an error of “singular matrix”); nevertheless, the resulting fit are not good (fitted points are far from experimental points) for three reasons:
a. the Obj function is really large at the end of the fit (> 10^3), thus accounting for the distance between experimental and fitted values;
b. the number of iterations remains below the maximum threshold, therefore the option to increase that threshold has obviously no effect;
c. the sensitivity to initial conditions is extremely high, therefore the resulting fit is not reliable.
I have tried to set some options to increase the maximum number of iterations or similar strategies, but nothing seems to work. Any suggestion is welcome!

# -------------------- importing packages
import numpy as np
import matplotlib.pyplot as plt
from gekko import GEKKO


# -------------------- declaring functions 

def rediscr(myarr, delta): #rediscretizzation function
    mydarr = np.floor((myarr // delta)).astype(int)
    mydarr = mydarr * delta
    return mydarr


def rmap(mytim, mydatx, mydaty, indarr, selarr, concarr): #function to map the concentration values on the re-discretized times
    j=0
    for i in range(len(mytim)):
        if(mytim[i]==mydatx[j]):
            indarr = np.append(indarr, i).astype(int);      
            selarr[i] = 1
            concarr[i] = mydaty[j]
            j += 1
            if(j == len(mydatx)):
                break;
    return indarr

# -------------------- input data, plotting & rediscr.

SE1 = np.genfromtxt("s_e1.txt")
PE1 = np.genfromtxt("q_e1.txt")
FE1 = np.genfromtxt("p_e1.txt")

# dataset 2
SE2 = np.genfromtxt("s_e2.txt")
PE2 = np.genfromtxt("q_e2.txt")
FE2 = np.genfromtxt("p_e2.txt")

plt.plot(SE1[:,0],SE1[:,1],'ro', label="s_e1")
plt.plot(PE1[:,0],PE1[:,1],'bo', label="p_e1")
plt.plot(FE1[:,0],FE1[:,1],'go', label="f_e1")

# plt.plot(SE2[:,0],SE2[:,1],'ro', label="s_e2")
# plt.plot(PE2[:,0],PE2[:,1],'bo', label="p_e2")
# plt.plot(FE2[:,0],FE2[:,1],'go', label="f_e2")


step= 100  # rediscretization factor
nout= "2set6par100p" # prefix for the filename of output files

nST = rediscr(SE1[:,0], step)
nPT = rediscr(PE1[:,0], step)
nFT = rediscr(FE1[:,0], step) 

nST2 = rediscr(SE2[:,0], step)
nPT2 = rediscr(PE2[:,0], step)
nFT2 = rediscr(FE2[:,0], step) 



# start modeling with gekko
m = GEKKO(remote=False)

timestep= (60000 // step) +1
m.time = np.linspace(0,60000,timestep)

# definig indXX= array index of the positions corresponding to measured concentratio values; select_XX= boolean array =0 if there is no measure, =1 if the measure exists; conc_X= concentration value at the selected time
indST=np.array([]).astype(int)
indPT=np.array([]).astype(int)
indFT=np.array([]).astype(int)
select_s=np.zeros(len(m.time), dtype=int)
select_f=np.zeros(len(m.time), dtype=int)
select_p=np.zeros(len(m.time), dtype=int)
conc_s=np.zeros(len(m.time), dtype=float)
conc_f=np.zeros(len(m.time), dtype=float)
conc_p=np.zeros(len(m.time), dtype=float)

indST2=np.array([]).astype(int)
indFT2=np.array([]).astype(int)
indPT2=np.array([]).astype(int)
select_s2=np.zeros(len(m.time), dtype=int)
select_f2=np.zeros(len(m.time), dtype=int)
select_p2=np.zeros(len(m.time), dtype=int)
conc_s2=np.zeros(len(m.time), dtype=float)
conc_f2=np.zeros(len(m.time), dtype=float)
conc_p2=np.zeros(len(m.time), dtype=float)

indST= rmap(m.time, nST, SE1[:,1], indST, select_s, conc_s)
indPT= rmap(m.time, nPT, PE1[:,1], indPT, select_p, conc_p)
indFT= rmap(m.time, nFT, FE1[:,1], indFT, select_f, conc_f)

indST2= rmap(m.time, nST2, SE2[:,1], indST2, select_s2, conc_s2)
indPT2= rmap(m.time, nPT2, PE2[:,1], indPT2, select_p2, conc_p2)
indFT2= rmap(m.time, nFT2, FE2[:,1], indFT2, select_f2, conc_f2)


kenz1 = 0.000341; # value of a characteristic global constant of the first reaction (esperimentally determined)
kenz2 = 0.0000196; # value of a characteristic global constant of the first reaction (esperimentally determined)

ka = m.FV(value=0.001, lb=0); ka.STATUS = 1     #   parameter to change in fitting the curves
kar = m.FV(value=0.000018, lb=0); kar.STATUS = 1        # parameter to change in fitting the curves
kb = m.FV(value=0.000018, lb=0); kb.STATUS = 1         # parameter to change in fitting the curves
kbr = m.FV(value=0.00000005, lb=0); kbr.STATUS = 1        #  parameter to change in fitting the curves
kcata = m.FV(value=0.01, lb=0); kcata.STATUS = 1        #  parameter to change in fitting the curves
kcatb = m.FV(value=0.01, lb=0);  kcatb.STATUS = 1        #  parameter to change in fitting the curves



SC1 = m.Var(SE1[0,1], lb=0, ub=SE1[0,1]) # fit to measurement
FC1 = m.Var(0, lb=0, ub=SE1[0,1]) # fit to measurement
PC1 = m.Var(0, lb=0, ub=SE1[0,1])    # fit to measurement
E1 =m.Var(1, lb=0, ub=1) # for enzyme and enzymatic complexes, I have no experimental data
ES1=m.Var(0, lb=0, ub=1) # for enzyme and enzymatic complexes, I have no experimental data
EP1=m.Var(0, lb=0, ub=1) # for enzyme and enzymatic complexes, I have no experimental data
E2 =m.Var(2, lb=0, ub=2) # for enzyme and enzymatic complexes, I have no experimental data
ES2=m.Var(0, lb=0, ub=2) # for enzyme and enzymatic complexes, I have no experimental data
EP2=m.Var(0, lb=0, ub=2) # for enzyme and enzymatic complexes, I have no experimental data
SC2 = m.Var(SE2[0,1], lb=0, ub=SE2[0,1]) # fit to measurement
FC2 = m.Var(0, lb=0, ub=SE2[0,1]) # fit to measurement
PC2 = m.Var(0, lb=0, ub=SE2[0,1])    # fit to measurement

sels = m.Param(select_s) # boolean point in time for s species
selp = m.Param(select_p) # ""                        p
self = m.Param(select_f) # ""                        f 
c_s = m.Param(conc_s) # concentration values
c_p = m.Param(conc_p) # concentration values
c_f = m.Param(conc_f) # concentration values

sels2 = m.Param(select_s2) # boolean point in time for s species
selp2 = m.Param(select_p2) # ""                        p
self2 = m.Param(select_f2) # ""                        f 
c_s2 = m.Param(conc_s2) # concentration values
c_p2 = m.Param(conc_p2) # concentration values
c_f2 = m.Param(conc_f2) # concentration values

m.Equations([E1.dt() ==-ka * SC1 * E1 +(kar + kcata) * ES1 - kb * E1 * PC1 + (kbr + kcata) * EP1, \
PC1.dt() == kcata * ES1 - kb * E1 * PC1 +kbr * EP1, \
ES1.dt() == ka * E1 * SC1 - (kar + kcata) * ES1, \
SC1.dt() == -ka * SC1 * E1 + kar * ES1,\
EP1.dt() == kb * E1 * PC1 - (kbr + kcata) * EP1, \
FC1.dt() == kcata * EP1, \
E2.dt() == -ka * SC2 * E2 +(kar + kcatb) * ES2 - kb * E2 * PC2 + (kbr + kcatb) * EP2, \
PC2.dt() == kcatb * ES2 - kb * E2 * PC1 +kbr * EP2, \
ES2.dt() == ka * E2 * SC2 - (kar + kcatb) * ES2, \
SC2.dt() == -ka * SC2 * E2 + kar * ES2,\
EP2.dt() == kb * E2 * PC2 - (kbr + kcatb) * EP2, \
FC2.dt() == kcatb * EP2 ])

m.Minimize((sels*(SC1-c_s))**2 + (self*(FC1-c_f))**2 + (selp*(PC1-c_p))**2 + (sels2*(SC2-c_s2))**2 + (self2*(FC2-c_f2))**2 + (selp2*(PC2-c_p2))**2)

m.options.IMODE = 5   # dynamic estimation
m.options.SOLVER = 1
m.solve(disp=True, debug=False)    # display solver output
ai= m.options.APPINFO

if(ai):
    print("Impossibile to solve!\n")
else: # output datafiles and graphs
    fk_enz_a = kcata.value[0] /((kar.value[0] + kcata.value[0])/ka.value[0])
    fk_enz_b = kcatb.value[0] /((kbr.value[0] + kcatb.value[0])/kb.value[0])
    frac_kenza = fk_enz_a/kenz1
    frac_kenzb = fk_enz_b/kenz2
    print("Solver APOPT - ka= ", ka.value[0], "kb= ",kb.value[0], "kar= ", kar.value[0], "kbr= ", kbr.value[0], "kcata= ", kcata.value[0], "kcata= ", kcatb.value[0], "kenz_a= ", fk_enz_a, "frac_kenz_a=", frac_kenza, "kenz_b= ", fk_enz_b, "frac_kenz_b=", frac_kenzb)     

    solv="_a_";
    tis=m.time[indST]
    fcs=np.array(SC1)
    pfcs= fcs[indST]
    tif=m.time[indFT]
    fcf=np.array(FC1)
    pfcf=fcf[indFT]
    tip=m.time[indPT]
    fcp=np.array(PC1)
    pfcp=fcp[indPT]
    fce=np.array(E1)
    fces=np.array(ES1)
    fcep=np.array(EP1)

    np.savetxt(nout+solv+"_fit1.txt", np.c_[m.time, fcs, fcp, fcf, fce, fces, fcep], fmt='%f', delimiter='\t')
    np.savetxt(nout+solv+"_s1.txt", np.c_[tis, pfcs], fmt='%f', delimiter='\t')
    np.savetxt(nout+solv+"_p1.txt", np.c_[tip, pfcp], fmt='%f', delimiter='\t')
    np.savetxt(nout+solv+"_f1.txt", np.c_[tif, pfcf], fmt='%f', delimiter='\t')


    tis2=m.time[indST2]
    fcs2=np.array(SC2)
    pfcs2= fcs2[indST2]
    tif2=m.time[indFT2]
    fcf2=np.array(FC2)
    pfcf2=fcf2[indFT2]
    tip2=m.time[indPT2]
    fcp2=np.array(PC2)
    pfcp2=fcp2[indPT2]
    fce2=np.array(E2)
    fces2=np.array(ES2)
    fcep2=np.array(EP2)

    np.savetxt(nout+solv+"_fit2.txt", np.c_[m.time, fcs2, fcp2, fcf2, fce2, fces2, fcep2], fmt='%f', delimiter='\t')
    np.savetxt(nout+solv+"_s2.txt", np.c_[tis2, pfcs2], fmt='%f', delimiter='\t')
    np.savetxt(nout+solv+"_p2.txt", np.c_[tip2, pfcp2], fmt='%f', delimiter='\t')
    np.savetxt(nout+solv+"_f2.txt", np.c_[tif2, pfcf2], fmt='%f', delimiter='\t')


    plt.plot(tis, pfcs,'gx', label="Fs_e1")
    plt.plot(tip, pfcp,'bx', label="Fp_e1")
    plt.plot(tif, pfcf,'rx', label="Ff_e1")

    plt.plot(tis2, pfcs2,'gx', label="Fs_e2")
    plt.plot(tip2, pfcp2,'bx', label="Fp_e2")
    plt.plot(tif2, pfcf2,'rx', label="Ff_e2")



    plt.axis([0, 60000, 0, 60])
    plt.legend()
    plt.savefig(nout+solv+"fit.png")

    plt.close()

score 0 · Answer 1 · answered Jun 19 '20 at 04:46

0

There is no s_e1.txt or other data files so I'll give a sample problem that illustrates some of the methods that you can use. However, let me give you some insights on your questions:

The error with m.time=np.linspace(0,60000,1) is that there is only 1 time point and this produces the array array([0.]). You need at least 2 time points for dynamic problems such as np.linspace(0,60000,2) to give array([ 0., 60000.]).
If you have too many time points such as np.linspace(0,1,60000) then the application can run out of memory because the problem is too large (>4 GB) if you are using the local 32-bit Windows application with remote=False. This shouldn't be a problem for the Linux or MacOS versions that are compiled as 64-bit executables.
You can include the exact time points where your measurements occurred. There is no need to put in approximate time points. You can set m.time = [0,0.1,0.5,0.9,...,50000,60000].
Set up the objective to skip certain time points if they are missing. The minimal example below shows how to skip measurements when p1 or p2 are zero. The slopes a and b are estimated. The values of -5 in m1 and -6 in m2 are ignored.

from gekko import GEKKO
m = GEKKO()
m.time = [0,1,2,3,4.5,6]
a = m.FV(); a.STATUS = 1
b = m.FV(); b.STATUS = 1
p1 = m.Param([0,0,1,0,0,1]) # indicate where there are measurements
p2 = m.Param([1,1,0,1,0,1])
m1 = m.Param([3,-5,2.5,-5,-5,1.0]) # measurements
m2 = m.Param([0,1,-6,2.5,-6,1.7])
v1 = m.Var(m1) # initialize with measurements
v2 = m.Var(m2)
# add equations
m.Equations([v1.dt()==a, v2.dt()==b])
# add objective function
m.Minimize(p1*(m1-v1)**2)
m.Minimize(p2*(m2-v2)**2)
m.options.IMODE = 6
m.solve()

import matplotlib.pyplot as plt
plt.figure(figsize=(12,5))
plt.plot(m.time,v1,'r--',label='v1')
plt.plot(m.time,v2,'b:',label='v2')
plt.plot(m.time,m1,'ro',label='m1')
plt.plot(m.time,m2,'bx',label='m2')
plt.savefig('demo.png'); plt.show()

answered Jun 19 '20 at 04:46

John Hedengren

12,068
1
21
25

thank you very much for your suggestion. I have imported the original time measurements (rounding them to integers) and storing them in m.time; at the same time I have generated the boolean arrays with the existant points and those with the concetration measurements. I have checked the data content of each array, finding that the content is right and placed in the right position. – robyrobur Jun 20 '20 at 16:39
Unfortunately, the problems remain. The APOPT stops after 13 steps declaring it has reached the convergence (convergence value= 10^(-11)) but the Obj function value is still very high (> 10^4) and in fact the fit is not good and the values calculated for k coefficients are far from being biologically reliable. Thank you very much for your support and kindness! Roberto – robyrobur Jun 20 '20 at 16:39
You probably have multiple local minima for your problem and either need a global solver or else a multi-start method that can try different initial guesses until it achieves the optimal solution. The other thing to try is to use literature values or first solve a simpler problem to give you more insight on the parameter values. – John Hedengren Jun 21 '20 at 02:33
Thank you again! surely, this is the scenario: because of this situation I have tried both APOPT and IPOPT (that should be a global solver, if I am not wrong). Unfortunately they behave the same way. Is there some option to set in the m.solver.options that enables a "global solver" option or forces the solver to stop computations only when the Obj function is below a defined threshold? Thank you again for your support! – robyrobur Jun 21 '20 at 15:47
IPOPT is also a local minimizer. Unfortunately, there is no global optimizer in Gekko yet. You may want to look at BARON or Couenne. Otherwise, better initial guesses for the parameters could help. If you do have better initial guesses then I recommend that you first perform a simulation and then the parameter regression. The simulation will help the optimizer because it starts with a feasible solution that uses your initial guess values. – John Hedengren Jun 21 '20 at 16:38
1

Unfortunately, I have no values from literature for this system. Nevertheless, I have some further constraints that I could impose on the system. For instance, the ratio (kcata+kar)/ka should be equal to k_enza (an experimentally measured values); analogously: (kcatb+kbr)/kb = k_enzb. Can I add these coinstraints to the system? I have tried to add them as m.Equation, but I get errors. – robyrobur Jun 22 '20 at 14:30
Perhaps you could set an objective function to minimize the difference. Also, you could calculate `kcatb`, `kar`, and `ka` if you have 3 data points for `k_enza`. You could also perform a regression to come up with good guess values for `kcatb`, `kar`, and `ka` by minimizing the difference between the measured and predicted `k_enza`. Here is a tutorial (but with energy prices): https://apmonitor.com/me575/index.php/Main/NonlinearRegression This regression would be a separate application, just to find good guess values. It won't work to add those as equations to your model. – John Hedengren Jun 22 '20 at 18:13
1

The parameters k_enza and k_enzb have only one value, then the nonlinear regression approach is not feasible. I am currently trying to use Couenne. Thank you again for your suggestions! – robyrobur Jun 23 '20 at 16:11
Good idea to try Couenne. It is not a problem to have only one value for `k_enza` and `k_enzb`. You can declare them as `m.FV()` and set `k_enza.STATUS=1` to have it optimize that parameter. – John Hedengren Jun 23 '20 at 20:02

how to improve the quality of a nonlinear fit with python GEKKO?

1 Answers1