7

I am experimenting with the Q-learning algorithm. I have read from different sources and understood the algorithm, however, there seem to be no clear convergence criteria that is mathematically backed.

Most sources recommend iterating several times (example, N = 1000), while others say convergence is achieved when all state and action pairs (s, a) are visited infinitely often. But the question here is, how much is infinitely often. What is the best criteria for someone who wants to solve the algorithm by hand?

I would be grateful if someone could educate me on this. I would also appreciate any articles to this effect.

Regards.

drtamakloe
  • 97
  • 1
  • 7
  • 1
    This is off topic for Stack Overflow IMO. – AMC Jan 13 '20 at 02:14
  • @drtamakloe If one of the answers below has solved your question, please consider [accepting it](https://meta.stackexchange.com/q/5234/179419) by clicking the check mark next to it. This indicates to the wider community that you've found a solution. – Brett Daley Feb 03 '20 at 03:30

2 Answers2

14

Q-Learning was a major breakthrough in reinforcement learning precisely because it was the first algorithm with guaranteed convergence to the optimal policy. It was originally proposed in (Watkins, 1989) and its convergence proof was refined in (Watkins & Dayan, 1992).

In short, two conditions must be met to guarantee convergence in the limit, meaning that the policy will become arbitrarily close to the optimal policy after an arbitrarily long period of time. Note that these conditions say nothing about how fast the policy will approach the optimal policy.

  1. The learning rates must approach zero, but not too quickly. Formally, this requires that the sum of the learning rates must diverge, but the sum of their squares must converge. An example sequence that has these properties is 1/1, 1/2, 1/3, 1/4, ...
  2. Each state-action pair must be visited infinitely often. This has a precise mathematical definition: each action must have a non-zero probability of being selected by the policy in every state, i.e. π(s, a) > 0 for all (s, a). In practice, using an ε-greedy policy (where ε > 0) ensures that this condition is satisfied.
Brett Daley
  • 544
  • 3
  • 6
4

Any RL algorithm converges when the learning curve gets flat and no longer increases. However, for each case, specific elements should be considered as it depends on your algorithm's and your problem's specifications.

In theory, it has been proven that Q-Learning converges towards the optimal solution but It is usually not obvious how to tune the hyperparameters like and in a way that convergence is insured.

Keep in mind that Q-learning is an old algorithm and kind of out-dated,it is a good way to learn about RL but there are better ways to solve a real-life problem.

Alaleh
  • 1,008
  • 15
  • 27