Gradient Descent with constraints (lagrange multipliers)

Question

I'm trying to find the min of a function in N parameters using gradient descent. However I want to do that while limiting the sum of absolute values of the parameters to be 1 (or <= 1, doesn't matter). For this reason I am using the method of lagrange multipliers so if my function is f(x), I will be minimizing f(x) + lambda * (g(x)-1) where g(x) is a smooth approximation for the sum of absolute values of the parameters.

Now as I understand, the gradient of this function will only be 0 when g(x)=1, so that a method to find a local minimum should find the minimum of my function in which my condition is also satisfied. The problem is that this addition my function unbounded so that Gradient Descent simply finds larger and larger lambdas with larger and larger parameters (in absolute value) and never converges.

At the moment I'm using python's (scipy) implementation of CG so I would really prefer suggestions that do not require me to re-write / tweak the CG code myself but use an existing method.

Chris Taylor · Accepted Answer · 2017-10-27T12:57:20.010

34

The problem is that when using Lagrange multipliers, the critical points don't occur at local minima of the Lagrangian - they occur at saddle points instead. Since the gradient descent algorithm is designed to find local minima, it fails to converge when you give it a problem with constraints.

There are typically three solutions:

Use a numerical method which is capable of finding saddle points, e.g. Newton's method. These typically require analytical expressions for both the gradient and the Hessian, however.
Use penalty methods. Here you add an extra (smooth) term to your cost function, which is zero when the constraints are satisfied (or nearly satisfied) and very large when they are not satisfied. You can then run gradient descent as usual. However, this often has poor convergence properties, as it makes many small adjustments to ensure the parameters satisfy the constraints.
Instead of looking for critical points of the Lagrangian, minimize the square of the gradient of the Lagrangian. Obviously, if all derivatives of the Lagrangian are zero, then the square of the gradient will be zero, and since the square of something can never be less then zero, you will find the same solutions as you would by extremizing the Lagrangian. However, if you want to use gradient descent then you need an expression for the gradient of the square of the gradient of the Lagrangian, which might not be easy to come by.

Personally, I would go with the third approach, and find the gradient of the square of the gradient of the Lagrangian numerically if it's too difficult to get an analytic expression for it.

Also, you don't quite make it clear in your question - are you using gradient descent, or CG (conjugate gradients)?

edited Oct 27 '17 at 12:57

answered Sep 05 '12 at 15:31

Chris Taylor

46,912
15
110
154

I'm using conjugate gradients. Thanks for the detailed answer! – nickb Sep 05 '12 at 17:41
@chris-taylor Do you mean square of the gradient of the Lagrangian or the gradient of the square of the Lagrangian? What is the square of a gradient? – Sohail Si Oct 20 '15 at 18:55
@chris-taylor Can you introduce a reference/paper/textbook for your answer (especially the third solution). I am coding in JS which does not have libraries for constraint optimizers and need to try a simple gradient descent to test feasibility of an approach. – Sohail Si Oct 20 '15 at 19:09
3

@SohailSi Si - it looks like this book has useful information http://www.mit.edu/~dimitrib/Constrained-Opt.pdf – Andrei Sura Oct 30 '17 at 22:21
The third approach will yield just one of potentially multiple critical points of the Lagrangian, not necessarily corresponding to the minimum of the function f. For instance the example 2 from https://en.wikipedia.org/wiki/Lagrange_multiplier has as much as 6 critical points. How do you find all those points with gradient descent? Is there any other way than just start multiple times from random points? – godfryd Apr 06 '18 at 12:37
1

I found this to be a useful addendum to the third method mentioned above: https://en.wikipedia.org/wiki/Lagrange_multiplier#Example_4:_Numerical_optimization – Sérgio Agostinho Jun 14 '19 at 10:01

score 7 · Answer 2 · edited Feb 10 '23 at 19:44

In their 1987 paper "Constrained Differential Optimization", Platt and Barr solve this problem in a way that's really nice and easy. They call their method the basic differential multiplier method (BDMM).

The method claims that for a Lagrangian: L(x, b) = f(x) + b g(x)

by doing gradient descent on x while doing gradient 'ascend' on b, you will finally converge to a stationary point of L(x, b), which is a local minima of f(x) under the constraint g(x)=0. Penalty method could also be combined to make converge faster and stabler (which the authors call the modified differential multiplier method).

Generally just reversing the gradient of b will work.

I've try it in several simple case and it works, though I don't really understand why after reading that paper.

Thanks! Could you post the full reference? – a06e Dec 10 '22 at 17:40 — a06e, Dec 10 '22 at 17:40

score 6 · Answer 3 · answered Dec 17 '14 at 02:19

Probably too late to be helpful to the OP but may be useful to others in the same situation:

A problem with absolute-value constraints can often be reformulated into an equivalent problem that only has linear constraints, by adding a few "helper" variables.

For example, consider problem 1:

Find (x1,x2) that minimises f(x1,x2) subject to the nonlinear constraint |x1|+|x2|<=10.

There is a linear-constraint version, problem 2:

Find (x1,x2,x3,x4) that minimises f(x1,x2) subject to the following linear constraints:

x1<=x3
-x1<=x3
x2<=x4
-x2<=x4
x3+x4<=10

Note:

If (x1,x2,x3,x4) satisfies constraints for problem 2, then (x1,x2) satisfies constraints for problem 1 (because x3 >= abs(x1), x4 >= abs(x2) )
If (x1,x2) satisfies constraints for problem 1, then we can extend to (x1,x2,x3,x4) satisfying constraints for problem 2 by setting x3=abs(x1), x4=abs(x2)
x3,x4 have no effect on the target function

It follows that finding an optimum for problem 2 will give you an optimum for problem 1, and vice versa.

Gradient Descent with constraints (lagrange multipliers)

3 Answers3