Simulated Annealing (SA) and Reinforcement Learning (RL) algorithms are meant to solve different classes of problems. The former is meant to find a global optimum while the later is meant to find a policy that maximize a reward (not directly a reward nor a state). More precisely, in RL, agents do actions regarding a reward and their current state (feedback). The policy of an agent can be seen as a map defining the probability of doing an action given a state and the value function defined how good is it to be in a state considering all future actions.
RL algorithms can be applied to optimize the policy of an agent in game as long as you can attribute a score to the players. The reward can typically be the score difference between two time-step (ie. rounds). For many games, like chess for example, an opponent can impact the state of the agent and the agent can just react to it based on a feedback loop. The goal in such case is to find the sequence of operation that maximize the chance to win. Using naively SA for such a problem does not make much sense: there is no need to find the best global state. In fact, if we try to apply SA in this case, a good opponent will quickly prevent SA to converge to a good global optimal. In fact, SA does not consider the opponent and do not care about the sequence of operation, only the result matters in SA.
Alternatively, if you want to find the minimum value of a derivable mathematical function (eg. high-order polynomials), then RL algorithm are quite useless (and inefficient) because they focus on optimizing the optimal policy while you do not need that (though an optimal policy can help to find a global optimal, SA is already good for that), you only want the optimal state (and possibly its associated objective value).
Another key difference is that AFAIK E(s)
is predefined in SA, while V(s)
is generally unknown and must be found by RL algorithms. This is a huge difference since in practice V(s)
is is dependent of the policy which the RL algorithm need to also find. If V(s)
is known, then the policy can be trivially deduced (the agent needs to perform the action that maximize V(s)
) and if an optimal policy is known, then V(s)
can be approximated computed from the Markov chain.