How does scipy.optimize.minimize take the derivative of the loss function?

Question

I am trying to figure out how the scipy.optimize.minimize works in machine learning. So far i understand that you pass it a loss function, so that it can find the parameter values that gives you the lowest loss. But as far as i understand, it first has to find the gradient/Hessian of the loss function before it can find the minima. How does it do that? How does it know how to take the derivatives of a python function, which contains other function calls and algorithms inside it? How can it deduce that to a mathematical function, that it can take the derivate of? In my case the loss function that i pass into minimize does the following: First it solves a epidemiological model (SEIRD), that is meant to predict confirmed cases and deaths of COVID-19. Then it compares that models results to the actual data, and finds the MSLE, which it returns. The code looks like this:

def model_decay(params, data, population, return_solution=False, full_validation=True, forecast_days=0):
R_0, theta, k, L, R_t, D_rec, D_inc = params # R_0, fatality rate, k form, L shape, R end
N = population # Population of each country

n_infected = data['ConfirmedCases'].iloc[0] 
n_dead = data['Fatalities'].iloc[0]
max_days = len(data) + forecast_days 
rho = 1.0 / D_rec
alpha = 1.0 / D_inc    
beta = R_0 * rho  

y0 = N - n_infected, 0, n_infected, 0 ,0  

def beta(t):
    return ((R_0-R_t) / (1 + (t/L)**k)+R_t) * rho

t= np.arange(max_days)

# Solve the SEIR differential equation.
sol = solve_ivp(seir_d, [0, max_days],y0,  args=(N, beta, rho, alpha, theta),t_eval=t)
sus, exp, inf, rec, dead = sol.y    

#model_decay.counter +=1
#print(model_decay.counter)

# Predict confirmedcases
y_pred_cases = np.clip((inf + rec+dead),0,np.inf) 
y_true_cases = data['ConfirmedCases'].values

# Predict fatalities
y_pred_dead = np.clip((dead),0,np.inf) 
y_true_dead = data['Fatalities'].values    

#Thanks to anjum-48 for the following lines of code in this function.

optim_days = min(20, len(data))  # Days to optimise for finds the lowest num. 

weights = 1 / np.arange(1, optim_days+1)[::-1]  # Recent data is more heavily weighted
# using mean squre log error to evaluate
msle_conf = mean_squared_log_error(y_true_cases[-optim_days:], y_pred_cases[-optim_days:],weights)
msle_dead = mean_squared_log_error(y_true_dead[-optim_days:], y_pred_dead[-optim_days:],weights)
if full_validation == True :
    msle = np.mean([msle_conf,msle_dead])
else : 
    msle = msle_conf

if return_solution:
    return msle, sol
else:
    return msle

And the minimize call looks like this:

model_decay_res = minimize(model_decay, [1.8, 0.04, 2, 20, 1 ,15, 4.9], 
                bounds=((0, 6.49), (0.01, 0.15), (0, 5), (0, 200),(0,6.49),(3.48,18),(4.9,5.9)),
                args=(train_data, population, False,full_validation),
                method='L-BFGS-B')

It gives me a very low MSLE in the end so it works pretty well, i just dont understand how exactly.

See [scipy-doc](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.approx_fprime.html#scipy.optimize.approx_fprime) and [wiki](https://en.wikipedia.org/wiki/Numerical_differentiation). In your case with `'L-BFGS-B'`, i think the the numerical-differentiation outsourced to the fortran-based external code of the solver (instead of using scipy's code). Of course there some rules to obey: most solvers in scipy.optimize assume that your objective is twice-differentiable and something like `np.clip((inf + rec+dead),0,np.inf) ` might be scary then and your positive result is "luck" — sascha, Jun 18 '20 at 14:49

score 2 · Answer 1 · answered Jun 20 '20 at 02:43

But as far as i understand, it first has to find the gradient/Hessian of the loss function before it can find the minima. How does it do that?

It depends on what you give it for arguments and which version of SciPy you have in your python environment. Last I checked there are a few supported methods for a gradient. The docs for the 1.4.1 version here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html

show

jac{callable, ‘2-point’, ‘3-point’, ‘cs’, bool}, optional

And describes which minimizers support which. You can use a finite difference scheme '2-point' which is the default for many.

A callable means that you provide a function or object which has __call__ attribute implemented (standalone functions have this by default, objects do not). You'd usually use this if you know the jacobian or gradient b/c you've calculated analytically and implemented in python (or C/C++ via a Low level callable: https://docs.scipy.org/doc/scipy-1.4.0/reference/generated/scipy.LowLevelCallable.html?highlight=lowlevelcallable#scipy.LowLevelCallable )

The cs option is very cool and one I learned about when I first started using SciPy. Basically all your operations in your objective function are passed complex arguments with a very small complex value (say 0.0001 as an example) then taking the complex part of the result gives you a good estimate of the gradient (better than finite differences).

The boolean is more about whether to represent as sparse or dense and used in not all optimizer routines.

One thing I should clarify is the minimize routine just forwards to the relevant optimizer method (e.g. 'trust-constr', 'slsqp' etc.) where the optimization is actually handled. icular to Hessians.

Hope that helps you understand what is going on "under the hood" of the minimize routine.

How does it know how to take the derivatives of a python function, which contains other function calls and algorithms inside it?

In the case you've provided I think it is just calling the function and nothing special.

Also, as a side note (not either of your 2 questions) if you want to track progress of the optimizer you can pass disp=True or disp=1 and that should show you some high level tracking output as the optimizer progresses.

How does scipy.optimize.minimize take the derivative of the loss function?

1 Answers1