Simulated Annealing - Intuition

Question

Cross posted from csexchange:

Most versions of simulated annealing I've seen are implemented similar to what is outlined in the wikipedia pseudocode below:

Let s = s0
For k = 0 through kmax (exclusive):
  T ← temperature( 1 - (k+1)/kmax )
  Pick a random neighbour, snew ← neighbour(s)
  If P(E(s), E(snew), T) ≥ random(0, 1):
    s ← snew
Output: the final state s

I am having trouble understanding how this algorithm does not get stuck in a local optima as the temperature cools. If we jump around at the start while the temp is high, and eventually only take uphill moves as the temp cools, then isn't the solution found highly dependent on where we just so happened to end up in the search space as the temperature started to cool? We may have found a better solution early on, jumped off of it while the temp was high, and then be in a worse-off position as the temp cools and we transition to hill climbing.

An often listed modification to this approach is to keep track of the best solution found so far. I see how this change mitigates the risk of "throwing away" a better solution found in the exploratory stage when the temp is high, but I don't see how this is any better than simply performing repeated random hill-climbing to sample the space, without the temperature theatrics.

Another approach that comes to mind is to combine the ideas of keeping track of the "best so far" with repeated hill climbing and beam search. For each temperature, we could perform simulated annealing and track the best 'n' solutions. Then for the next temperature, start from each of those local peaks.

UPDATE: It was rightly pointed out I didn't really ask a specific question, I've updated in the comments but want to reflect it here:

I don't need a transcription of the pseudo-code into essay form, I understand how it works - how the temperature balances exploration vs exploitation, and how this (theoretically) helps avoid local optima.

My specific questions are:

Without modification to keep track of the global best solution, wouldn't this algorithm be extremely prone to local optima, despite the temperature component? Yes, I understand the probability of taking a worse move declines as the temperature cools (e.g. it transitions to pure hill-climbing), but it would be possible that you had found a better solution early on in the exploitation portion (temp hot), jumped off of it, and then as the temperature cools you have no way back because you're on a path that leads to a new local peak.
With the addition of tracking the global optima, I can absolutely see how this mitigates getting stuck at local peaks avoiding the problem described above. But how does this improve upon simple random search? One specific example being repeated random hill climbing. If you're tracking the global optima and you happen to hit it while in the high-temp portion, then that essentially is a random-search.
In what circumstances would this algorithm be preferable to something like repeated random hill-climbing and why? What properties does a problem have that makes it particularly suited to SA vs an alternative.

@ScottHunter - I guess I'm wondering a.) how the unmodified version is able to avoid local optima, to me it would seem no different than hill climbing from a random point in the solution space, where that random point is where you were when the temp cooled and b.) in the modified version, how it's any more effective than simply performing random restart hill climbing. — Solaxun, Jul 11 '22 at 18:48
As annealing in metallurgy there is no guaranty that you'll reach the optimum, but if you have the right material, the right operating conditions,, etc, in the mean you reach a good solution (not the best possible but something satisfying your needs). — Jean-Baptiste Yunès, Jul 12 '22 at 08:59
@Solaxun because as in "real" annealing, part of your material already is in the good state, then increasing again the temperature would not break this and give you a chance to make more parts in the good state... — Jean-Baptiste Yunès, Jul 12 '22 at 09:01
I believe the only guarantees ever proved for simulated annealing have been that in time exponential in the number of steps in the cooling schedule (i.e. very slow cooling) you can guarantee convergence to the global optimal for a certain class of functions under pretty modest assumptions. — ldog, Jul 19 '22 at 06:42

score 5 · Answer 1 · answered Jul 11 '22 at 19:00

5

By my understanding, simulated annealing is not guaranteed to not get stuck in a local maxima (for maximization problems), especially as it "cools" where later in the cycle as k -> kmax.

So as you say, we "jump all over the place" at the start, but we are still selective in whether or not we accept that jump, as the P() function that determines the probability of acceptance is a function of the goal, E(). Later in the same wikipedia article, they describe the P() function a bit, and suggest that if e_new > e, then perhaps P=1, always taking the move if it is an improvement, but perhaps not always if it isn't an improvement. Later in the cycle, we aren't as willing to take random leaps to lesser results, so the algorithm tends to settle into a max (or min), which may or may not be the global.

answered Jul 11 '22 at 19:00

AirSquid

10,214
2
7
31

Thanks for the reply. I guess my point is that when SA cools we are slowly transitioning into hill-climbing and when that transition occurs, we may be at local maxima. Imagine there are 5 maxima in the solution space. It would seem entirely random to me which one we end up at. I know SA is not guaranteed to avoid local maxima (without an absurdly slow cooling schedule), but it seems that the unmodified version is really just rolling the dice at which maxima it ends up at. Modifying it to track the global best improves that, but why not just do random repeat hill climbing at that point? – Solaxun Jul 11 '22 at 19:33
As most metaheuristic methods, the concept behind SA is a gradual transition between Exploration and Exploitation. Yes, you could just randomly repeat hill climbing, but you would miss out on the gradual transition. Also, mind the No Free Lunch Theorem: There is never going to be a method that is better than any other method at optimizing _all_ objective functions. You can always find an objective function where a random repeat hill climbing works better. You can always tune a parameter to a bunch of objective functions and lose out with others. Yes, it is random! With a high chance of "good". – ypnos Jul 11 '22 at 19:38
Maybe a better way to ask the question - I understand how the algorithm works, what I don't understand is how this approach makes the chances of finding the global optima any better than random sampling. I get the slow transition but I'm not sure if that really makes a difference. In what circumstances would SA outperform random hill climbing? Is it problem dependent? What characteristics of a problem would be more suitable for SA vs RHC? – Solaxun Jul 11 '22 at 19:42
@Solaxun: re: "how this approach makes the chances of finding the global optima any better than random sampling", I think what may be causing you confusion is that this **is** random sampling, specifically, it is the metropolis-hastings algorithm. The idea is that optimizing is hard, so instead SA flips things on its head and says, lets build a sampler such that we converge to the ground-state (global-optimal.) The idea is, we build this sampler which remains largely unchanged, and we periodically anneal the underlying distribution so that it becomes ever more peaked around the ground-state. – ldog Jul 22 '22 at 02:05
the hand-wavy part in all this comes in when I say we make the probability distribution more peaked (by annealing the temperature.) There are very real empirical tradeoffs you are making between sampling a more peaked vs less peaked distribution (which need to be measured and quantized for your problem.) But the underlying algorithm is still a simple application of the metropolis-hastings sampler:https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#Intuition – ldog Jul 22 '22 at 02:08
Also please read the last paragraph (or entire section) in here: https://en.wikipedia.org/wiki/Simulated_annealing#Acceptance_probabilities_2 It described under which conditions some researchers think making deterministic choices are better than random choices. – ldog Jul 22 '22 at 02:14

score 0 · Answer 2 · answered Jul 11 '22 at 20:48

The challenge in global function evaluation is to obtain good efficiency, i.e. find the optimal solution or one that is close to the optimum with as little function evaluations as possible. So many "we could perform..." reasonings are correct in terms of convergence to a good solution, but may fail to be efficient.

Now the main difference between pure climbing and annealing is that annealing temporarily admits a worsening of the objective to avoid... getting trapped in a local optimum. Hill climbing only searches the subspace where the function takes larger values than the best so far, which may fail to be connected to the global optimum. Annealing adds some "tunneling" effect.

Temperature decrease is not the key ingredient here and by the way, hill climbing methods such as the Simplex or Hooke-Jeeves patterns also work with diminishing perturbations. Temperature schemes are more related to multiscale decomposition of the target function.

score 0 · Answer 3 · answered Jul 19 '22 at 06:58

The question as asked doesn't seem very coherent, it seems to ask about guarantees offered by the vanilla simulated annealing algorithm, then proceeds to propose multiple modifications with little to no justification.

In any case, I will try to be coherent in this answer.

The only known rigorous proof of simulated annealing that I know of is described as

The concept of using an adaptive step-size has been essential to the development of better annealing algorithms. The Classical Simulated Annealing (CSA) was the first annealing algorithm with a rigorous mathematical proof for its global convergence (Geman and Geman, 1984). It was proven to converge if a Gaussian distribution is used for GXY (TK), coupled with an annealing schedule S(Tk) that decreases no faster than T = To/(log k). The integer k is a counter for the external annealing loop, as will be detailed in the next section. However, a logarithmic decreasing schedule is considered to be too slow, and for many problems the number of iterations required is considered as “overkill” (Ingber, 1993).

Where the proof of this can be found in

Geman, Stuart, and Donald Geman. "Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images." IEEE Transactions on pattern analysis and machine intelligence 6 (1984): 721-741.

In terms of successful modifications of SA, if you peruse literature you will find many empirical claims of superiority, but few proofs. Having studied this problem in the past, I can say from the various modified versions of SA I have tried empirically for the problems I was interested in at the time, the only SA modification that consistently gave better results than classical simulated annealing is known as parallel tempering.

Parallel tempering is the modification of SA whereby instead of mixing one Markov-Chain (i.e. the annealing step) multiple chains are mixed concurrently at very specific temperatures, with exchanges between chains occurring on a dynamic or static schedule.

Original paper on parallel tempering:

Swendsen, Robert H., and Jian-Sheng Wang. "Replica Monte Carlo simulation of spin-glasses." Physical review letters 57.21 (1986): 2607.

There are various free and proprietary implementations of parallel tempering available due to its effectiveness; I would highly recommend starting with a a high quality implementation instead of rolling your own.

Note also, parallel tempering is extensively used throughout various physical sciences for modelling purposes due to its efficacy, but for some reason has a very small representation in other areas of literature (such as computer science,) which actually baffles me somewhat. Perhaps computer scientists general disinterest with SA modifications may stem from their belief that SA is unlikely to be improved upon.

Simulated Annealing - Intuition

3 Answers3