Q-Learning convergence to optimal policy

Question

I am using rlglue based python-rl framework for q-learning. My understanding is that over number of episodes, the algorithm converges to an optimal policy (which is a mapping which says what action to take in what state).

Question1: Does this mean that after a number of episodes ( say 1000 or more ) I should essentially get the same state:action mapping?

When I plot the rewards (or rewards averaged over 100 episodes) I get a graph similar to Fig 6.13 in this link.

Question2: If the algorithm has converged to some policy why does the rewards fall down? Is there a possibility that the rewards vary drastically?

Question3: Is there some standard method which I can use to compare the results of various RL algorithms?

The Fig 6.13 link is dead. Could you please embed the image in the question (if you still can retrieve it somewhere)? That would increase the readability. — Tropilio, Apr 06 '20 at 09:57

score 4 · Accepted Answer · edited Mar 11 '19 at 10:54

4

Q1: It will converge to a single mapping, unless more than one mapping is optimal.

Q2: Q-Learning has an exploration parameter that determines how often it takes random, potentially sub-optimal moves. Rewards will fluctuate as long as this parameter is non-zero.

Q3: Reward graphs, as in the link you provided. Check http://rl-community.org.

edited Mar 11 '19 at 10:54

Most Wanted

6,254
5
53
70

answered Apr 15 '14 at 09:12

Don Reba

13,814
3
48
61

For Q2 and Q3, thanks for clarifying (and the link). I have a follow up question on Q1: So how do I determine which mapping is optimal? If the rewards are varying, can I take an average for N runs of each such optimal mapping and check? (Sorry if the question is too naive but I am still learning) – okkhoy Apr 15 '14 at 09:31
You could wait until the mapping is stable for some number of steps or take a look at the reward graph and see where it levels off, save for exploratory reward fluctuations. – Don Reba Apr 16 '14 at 02:18

Q-Learning convergence to optimal policy

1 Answers1