Doubts on what scipy.optimize.minimize is really doing

Question

I am trying to minimize a cost function and I got very strange results from scipy.optimize.minimize (with methods and 'SLSQP', 'L-BFGS-B').

I print the value of the cost function after each evaluation. First it performs the small perturbations before going into the supposedly right direction (ok). But then occurs something strange: it seems to change the initial cost function by something like value of the cost function at first evaluation - value of the cost function in the current evaluation and converges towards the value of the first evaluation of the cost function.

To illustrate that I created a toy function of 2 parameters (0.25 + 1000 * x1 ** 2 + 100 * x2 ** 2 + 0.1 * random()). x1 and x2 are restricted to the interval [0, 1] (bounds). X0 is set to (0.5, 0.5). Here is what i get:

cost function: 275.3414617153509 x1: 0.5 x2: 0.5
cost function: 275.34428666473536 x1: 0.5000000149011612 x2: 0.5
cost function: 275.3542128554434 x1: 0.5 x2: 0.5000000149011612
cost function: 0.2665482586461191 x1: 0.0 x2: 0.0
cost function: 68.9989043756609 x1: 0.24986835289808013 x2: 0.24986835289808013
cost function: 154.87646326641064 x1: 0.374835397734792 x2: 0.374835397734792
cost function: 210.70119869030185 x1: 0.4373600232007103 x2: 0.4373600232007103
cost function: 241.8621094503892 x1: 0.4686490613793924 x2: 0.4686490613793924
cost function: 258.36597245010955 x1: 0.4843084999840323 x2: 0.4843084999840323
cost function: 266.6807722679986 x1: 0.4921461216177911 x2: 0.4921461216177911
cost function: 270.96794190195914 x1: 0.49606891372760337 x2: 0.49606891372760337
cost function: 273.0999396362265 x1: 0.49803236262951744 x2: 0.49803236262951744
cost function: 274.23903284113646 x1: 0.4990151079476797 x2: 0.4990151079476797
cost function: 274.7564047455383 x1: 0.4995070260788122 x2: 0.4995070260788122

 fun: 274.7564047455383
 jac: array([189579.1440506 , 855714.52631378])
 message: 'Optimization terminated successfully'
nfev: 14
 nit: 1
njev: 1
status: 0
success: True
x: array([0.49950703, 0.49950703])

So I do not understand:

why the final result is 2.74.756... and not 0.2666
why it starts to converge towards X0

What makes me think that the cost function is "modified" (i.e., what it tries to minimize is not the cost function but initial cost function evaluation - current cost function evaluation) is that, sometimes, due the random() part of the toy function, the first guessed evaluation is a higher value than the perturbation evaluations and it also converges towards X0.

I am using Python 3.9.6 and scipy 1.6.1

Edit:

Here is the full code:

def toto(X):
   val  = 0.25 + 1000 * X[0] ** 2 + 100 * X[1] ** 2 + 0.1 * random();
   print("cost function:", val, 'x1:', X[0], 'x2:', X[1])
   return val

optimization = minimize(toto, [0.5, 0.5], method=”SLSQP”, bounds= [[0.0, 1.0], [0.0, 1.0]])
print(optimization)

Mathieu

It would be better if you include your whole script, rather than just the results of the optimization. — Tyberius, Sep 06 '21 at 16:42
Could someone reproduce the problem ? Or am I doing something really wrong ? — Mathieu, Sep 09 '21 at 18:37

Tyberius · Answer 1 · 2021-09-10T13:17:54.637

Trying your code, I get basically the same results.

I can't say I have a full solution to your problem, but I can point out a few issues. One is that scipy.optimize.minimize defaults to using a very small step to compute numerical gradients (e.g. for L-BFGS-B, the default step size eps equals 1e-8). To see why this is a problem, consider if you computed a numerical derivative from the optimal solution (0,0). The deterministic part of the derivative will be roughly 0, but what will be the stochastic part. It should the difference of the two random values divided by 1e-8. The most likely value for the difference will be 0.05 (based on the difference having a triangular distribution), so your derivative will roughly on the order of 1e6. So while the function doesn't differ much with this random noise, it has a substantial effect on the numerical derivative.

But if the gradients are so large, why is it saying it converged? Both the methods you listed also have an ftol convergence criteria, which causes convergence when the relative change in the function value between steps is below the threshold. SLSQP doesn't provide any description in it's convergence message, but L-BFGS-B at least gives a brief description of why it converged. For the cases where it was far away from (0,0), convergence was related to this ftol criteria. I don't think there is anything in the code specifically drawing it back towards your initial point; rather it just seems like a random step away from this doesn't lead to much of a change in the function value. If I ran the code repeatedly, it would converge to a lot of different solutions and not always come back to near this initial point.

You can't entirely fix this with just a numerical gradient based optimizer, but you can at least improve your result by changing the value of eps. I found that if I changed eps to 1e-4, it tended to converge to (0,0) or near it. Increasing eps doesn't entirely fix, as the gradient can still be significantly altered by random portion.

Other options are discussed in this prior post and include methods for denoising your function before evaluating gradients or evaluating the function across the range, fitting it with splines, and then optimizing the fitting function.

If your interest is in diagnosing the problems with this specific code, someone who knows more about the technical details of the scipy implementation can probably help. However, if your interest is generally about finding minima of noisy functions, I think this example makes clear that a numerical gradient based optimizer won't be sufficient for this purpose.

Thank you for the answer. I will try that and have a look at the other options. However, this does not explain why scipy.optimize.minimize returns the wrong optimal value (close to the initial point) instead of [0, 0] in the example of the post. — Mathieu, Sep 10 '21 at 13:02
The solutions you mentionned did not help to solve my problem as it does not seem to be a problem of convergence. In the example I posted, the algorithm got a better solution (4th call to the function to be optimized) than the one it returned as the result of the optimization (last call to the function to be optimized). — Mathieu, Sep 16 '21 at 13:14
@Mathieu That's absolutely a problem of convergence. The optimization is continuing initially because it thinks the derivatives and relative change in the function are large due to the noise. It ends later when the relative change in the function between iterations is small enough. I suppose you could save the best solution so far to a global variable outside the optimization, but from running your code multiple times, there were runs where it never came close to [0,0], so this won't entirely solve your problem either. — Tyberius, Sep 16 '21 at 14:24
@Mathieu And I mentioned that for the changes to `optimize.minimize`, they don't really fix the problem because doing numerical derivatives with noise simply doesn't work very well in general. The other question I link to provides ways of denoising your derivatives. — Tyberius, Sep 16 '21 at 14:26
My toy problem is certaintly not well chosen. However, the example clearly shows that there is a problem. Scipy.minimize should return the "The solution of the optimization" and in the example it clearly did not return the solution of the optimization which would be [0.0, 0.0] obtain in the 4th call to the function. — Mathieu, Sep 30 '21 at 20:57
As you said, I can create a function to manage to get it. However, this example, and other results, suggest that the problem could be more profound. In the example it returns the x corresponding to the last call (which correspond to the x the closest to the initial point) but it also returns the x the closest to the initial point in cases in which this does not correspond to the last call (neither to the minimum value). Therefore, it returns neither the "solution of the optimization" nor the last call of the last iteration. Conclusion: it is not clear to me what the method returns — Mathieu, Sep 30 '21 at 20:58
@Mathieu Its returning what it thinks is the solution, though the randomness makes this solution generally very poor. When you print the cost function, you are also printing it for steps that are run only to compute a numerical derivative. For example, when I run your code, it often ends after three prints: the first is the initial value, and the next two are small steps away used to compute the gradient. While the gradient is huge, it finds that the relative difference in function value is small, so it claims the optimization succeeded. It doesn't keep the value from these derivative calcs. — Tyberius, Sep 30 '21 at 22:05

Doubts on what scipy.optimize.minimize is really doing

1 Answers1