Parallel many dimensional optimization

Question

I am building a script that generates input data [parameters] for another program to calculate. I would like to optimize the resulting data. Previously I have been using the numpy powell optimization. The psuedo code looks something like this.

def value(param):
     run_program(param)
     #Parse output
     return value

scipy.optimize.fmin_powell(value,param)

This works great; however, it is incredibly slow as each iteration of the program can take days to run. What I would like to do is coarse grain parallelize this. So instead of running a single iteration at a time it would run (number of parameters)*2 at a time. For example:

Initial guess: param=[1,2,3,4,5]

#Modify guess by plus minus another matrix that is changeable at each iteration
jump=[1,1,1,1,1]
#Modify each variable plus/minus jump.
for num,a in enumerate(param):
    new_param1=param[:]
    new_param1[num]=new_param1[num]+jump[num]
    run_program(new_param1)
    new_param2=param[:]
    new_param2[num]=new_param2[num]-jump[num]
    run_program(new_param2)

#Wait until all programs are complete -> Parse Output
Output=[[value,param],...]
#Create new guess
#Repeat

Number of variable can range from 3-12 so something such as this could potentially speed up the code from taking a year down to a week. All variables are dependent on each other and I am only looking for local minima from the initial guess. I have started an implementation using hessian matrices; however, that is quite involved. Is there anything out there that either does this, is there a simpler way, or any suggestions to get started?

So the primary question is the following: Is there an algorithm that takes a starting guess, generates multiple guesses, then uses those multiple guesses to create a new guess, and repeats until a threshold is found. Only analytic derivatives are available. What is a good way of going about this, is there something built already that does this, is there other options?

Thank you for your time.

As a small update I do have this working by calculating simple parabolas through the three points of each dimension and then using the minima as the next guess. This seems to work decently, but is not optimal. I am still looking for additional options.

Current best implementation is parallelizing the inner loop of powell's method.

Thank you everyone for your comments. Unfortunately it looks like there is simply not a concise answer to this particular problem. If I get around to implementing something that does this I will paste it here; however, as the project is not particularly important or the need of results pressing I will likely be content letting it take up a node for awhile.

This is not directly related to your question; but as your task is this resource-intensive, wouldn't it make more sense to use a compiled language like C for performance benefits? — anroesti, Dec 07 '12 at 05:25
@Ophion You might want get your code reviewed first. Also consider these performance tips. http://wiki.python.org/moin/PythonSpeed/PerformanceTips — Larry Battle, Dec 07 '12 at 05:37
The primary code is in C and highly optimized, and unfortunately implementations to parallelize it across multiple compute nodes are not particularly effective. I need something to interface with the primary code and optimize a set of parameters that it was not designed to do. Recoding the primary to do this is an option, but likely more complicated and ultimately to little benefit. — Daniel, Dec 07 '12 at 15:20
So what exactly do you want to have parallel? What does run_program() do? If it is not messing with any variables, you could easily use a pool and its map function (http://docs.python.org/2/library/multiprocessing.html#multiprocessing.pool.multiprocessing.Pool.map) — Karsten, Dec 07 '12 at 15:44
run_program(param) executes the primary code with the imput parameters and returns a singular value. Essentially what I want to do is have a parallel version of the powell's algorithm or some other minimization algorithm that preferably does not require derivatives and can take multiple simultaneous guesses into account. — Daniel, Dec 07 '12 at 15:52
so `speed` is the issue or `complexity of optimization` is the issue???? — namit, Dec 08 '12 at 16:43
It does not necessarily mean I need an optimization that converges faster, but an algorithm that can take multiple guesses. I use the term parallelization to mean that it can run multiple instances of the program together. The reason for this is increasing the number of cores the program can use is fairly inefficient after a certain point, but I have many compute nodes to run many different instances of the program on. So if each node can run its own guess and the minimization algorithm can take all guesses into account, the total cpu time should go up, but real time should go down. — Daniel, Dec 08 '12 at 16:51
[sry, pressed enter] some info on the target function would help: are there local minima? is it a smooth (continuous) function? Is it a discrete function? do you know the boundaries in which the minimum would lie? can there be multiple minima? Basically any restrictions on the target function could help answer yout question. — Lars, Dec 11 '12 at 09:40
There are many local minima, it is smooth, discrete, there are boundaries and I know where approximately the local and global minima are. It is not a particularly difficult surface, hence the parabolic extrapolation doing a decent job. — Daniel, Dec 11 '12 at 17:24

score 3 · Answer 1 · answered Dec 14 '12 at 01:00

I had the same problem while I was in the university, we had a fortran algorithm to calculate the efficiency of an engine based on a group of variables. At the time we use modeFRONTIER and if I recall correctly, none of the algorithms were able to generate multiple guesses.

The normal approach would be to have a DOE and there where some algorithms to generate the DOE to best fit your problem. After that we would run the single DOE entries parallely and an algorithm would "watch" the development of the optimizations showing the current best design.

Side note: If you don't have a cluster and needs more computing power HTCondor may help you.

Hello, I am glad to hear of others with similar problems. Thank you for introducing me to modeFRONTIER, it looks like a very interesting software package. Something that watches various optimizations could be interesting - I will look into it. I fortunately have a cluster that more than meets my computing needs; however, its looking like for this particular project buying a small linux box and shoving it under a desk for a year is looking like the best way of going about this. — Daniel, Dec 14 '12 at 20:17

score 1 · Answer 2 · answered Dec 10 '12 at 09:24

Are derivatives of your goal function available? If yes, you can use gradient descent (old, slow but reliable) or conjugate gradient. If not, you can approximate the derivatives using finite differences and still use these methods. I think in general, if using finite difference approximations to the derivatives, you are much better off using conjugate gradients rather than Newton's method.

A more modern method is SPSA which is a stochastic method and doesn't require derivatives. SPSA requires much fewer evaluations of the goal function for the same rate of convergence than the finite difference approximation to conjugate gradients, for somewhat well-behaved problems.

Analytic derivatives are available, but questionable. Conjugate Gradient methods work really well for this type of work. My current implementation is a Powell's where the inner loop is parallelized. SPSA is really great, but not really what I am looking for here. Really the primary issue comes down to how to parallelize the minimization algorithm. — Daniel, Dec 10 '12 at 16:39

denis · Answer 3 · 2012-12-12T10:46:06.217

There are two ways of estimating gradients, one easily parallelizable, one not:

around a single point, e.g. (f( x + h direction_i ) - f(x)) / h; this is easily parallelizable up to Ndim
"walking" gradient: walk from x₀ in direction e₀ to x₁, then from x₁ in direction e₁ to x₂ ...; this is sequential.

Minimizers that use gradients are highly developed, powerful, converge quadratically (on smooth enough functions). The user-supplied gradient function can of course be a parallel-gradient-estimator.
A few minimizers use "walking" gradients, among them Powell's method, see Numerical Recipes p. 509.
So I'm confused: how do you parallelize its inner loop ?

I'd suggest scipy fmin_tnc with a parallel-gradient-estimator, maybe using central, not one-sided, differences.
(Fwiw, this compares some of the scipy no-derivative optimizers on two 10-d functions; ymmv.)

The is interesting I skipped over it when I was looking through the list of implemented functions, it would heavily depend on the gradients produced. I will give it a shot using the parabolic analytic derivatives. Numerical Recipes is a great book! For the Powell's Method you can find the minima of a dimension using other methods that can be parallelized. Still looking for something already implemented for point one if possible. — Daniel, Dec 11 '12 at 18:21

score 0 · Answer 4 · answered Dec 08 '12 at 02:11

0

I think what you want to do is use the threading capabilities built-in python. Provided you your working function has more or less the same run-time whatever the params, it would be efficient.

Create 8 threads in a pool, run 8 instances of your function, get 8 result, run your optimisation algo to change the params with 8 results, repeat.... profit ?

answered Dec 08 '12 at 02:11

Félix Cantournet

1,941
13
17

The issue is the optimization algorithm. I will rewrite the main text to make it clear. The primary question is: Is there a minimization algorithm already built that takes multiple inputs and can create multiple guesses to obtain the value of? – Daniel Dec 08 '12 at 16:09
can't you make an asynchronous process that run your optimisation function for one results every time a thread finishes a work ? Because the implementation you want to make will only be better than this if your working function has a very similar computing time whatever the parameters. – Félix Cantournet Dec 08 '12 at 19:32
Work function computation time variation is on the order of several percent. Asynchronous process would be a more advanced step, I need to obtain a better minimization algorithm first. – Daniel Dec 08 '12 at 19:46

score 0 · Answer 5 · answered Dec 10 '12 at 00:15

If I haven't gotten wrong what you are asking, you are trying to minimize your function one parameter at the time.

you can obtain it by creating a set of function of a single argument, where for each function you freeze all the arguments except one.

Then you go on a loop optimizing each variable and updating the partial solution.

This method can speed up by a great deal function of many parameters where the energy landscape is not too complex (the dependency between the parameters is not too strong).

given a function

energy(*args) -> value

you create the guess and the function:

guess = [1,1,1,1]
funcs = [ lambda x,i=i: energy( guess[:i]+[x]+guess[i+1:] ) for i in range(len(guess)) ]

than you put them in a while cycle for the optimization

while convergence_condition:
    for func in funcs:
        optimize fot func
        update the guess
    check for convergence

This is a very simple yet effective method of simplify your minimization task. I can't really recall how this method is called, but A close look to the wikipedia entry on minimization should do the trick.

Yes this would be very easy if all variables were not dependent on each other. — Daniel, Dec 10 '12 at 16:21
The method is useful because it works even on dependent variables, it just need more than one iteration for the convergence. — EnricoGiampieri, Dec 10 '12 at 17:50
You are correct in this; however, it is not an improvement over my current implementation. I will state again what I am really looking for is a way to parallelize a minimization algorithm, not piecewise one together. To restate a good example. Powell's algorithm uses two loops, if you parallelize the inner loop you can speed up the outer loop and thus the entire algorithm. — Daniel, Dec 10 '12 at 18:11

score 0 · Answer 6 · answered Dec 14 '12 at 03:04

You could do parallel at two parts: 1) parallel the calculation of single iteration or 2) parallel start N initial guessing.

On 2) you need a job controller to control the N initial guess discovery threads.

Please add an extra output on your program: "lower bound" that indicates the output values of current input parameter's decents wont lower than this lower bound.

The initial N guessing thread can compete with each other; if any one thread's lower bound is higher than existing thread's current value, then this thread can be dropped by your job controller.

score 0 · Answer 7 · answered Jul 07 '17 at 16:17

Parallelizing local optimizers is intrinsically limited: they start from a single initial point and try to work downhill, so later points depend on the values of previous evaluations. Nevertheless there are some avenues where a modest amount of parallelization can be added.

As another answer points out, if you need to evaluate your derivative using a finite-difference method, preferably with an adaptive step size, this may require many function evaluations, but the derivative with respect to each variable may be independent; you could maybe get a speedup by a factor of twice the number of dimensions of your problem. If you've got more processors than you know what to do with, you can use higher-order-accurate gradient formulae that require more (parallel) evaluations.
Some algorithms, at certain stages, use finite differences to estimate the Hessian matrix; this requires about half the square of the number of dimensions of your matrix, and all can be done in parallel.

Some algorithms may also be able to use more parallelism at a modest algorithmic cost. For example, quasi-Newton methods try to build an approximation of the Hessian matrix, often updating this by evaluating a gradient. They then take a step towards the minimum and evaluate a new gradient to update the Hessian. If you've got enough processors so that evaluating a Hessian is as fast as evaluating the function once, you could probably improve these by evaluating the Hessian at every step.

As far as implementations go, I'm afraid you're somewhat out of luck. There are a number of clever and/or well-tested implementations out there, but they're all, as far as I know, single-threaded. Your best bet is to use an algorithm that requires a gradient and compute your own in parallel. It's not that hard to write an adaptive one that runs in parallel and chooses sensible step sizes for its numerical derivatives.

Parallel many dimensional optimization

7 Answers7

Linked