5

I am using rlglue based python-rl framework for q-learning. My understanding is that over number of episodes, the algorithm converges to an optimal policy (which is a mapping which says what action to take in what state).

Question1: Does this mean that after a number of episodes ( say 1000 or more ) I should essentially get the same state:action mapping?

When I plot the rewards (or rewards averaged over 100 episodes) I get a graph similar to Fig 6.13 in this link.

Question2: If the algorithm has converged to some policy why does the rewards fall down? Is there a possibility that the rewards vary drastically?

Question3: Is there some standard method which I can use to compare the results of various RL algorithms?

okkhoy
  • 1,298
  • 3
  • 16
  • 29
  • The Fig 6.13 link is dead. Could you please embed the image in the question (if you still can retrieve it somewhere)? That would increase the readability. – Tropilio Apr 06 '20 at 09:57

1 Answers1

4

Q1: It will converge to a single mapping, unless more than one mapping is optimal.

Q2: Q-Learning has an exploration parameter that determines how often it takes random, potentially sub-optimal moves. Rewards will fluctuate as long as this parameter is non-zero.

Q3: Reward graphs, as in the link you provided. Check http://rl-community.org.

Most Wanted
  • 6,254
  • 5
  • 53
  • 70
Don Reba
  • 13,814
  • 3
  • 48
  • 61
  • For Q2 and Q3, thanks for clarifying (and the link). I have a follow up question on Q1: So how do I determine which mapping is optimal? If the rewards are varying, can I take an average for N runs of each such optimal mapping and check? (Sorry if the question is too naive but I am still learning) – okkhoy Apr 15 '14 at 09:31
  • You could wait until the mapping is stable for some number of steps or take a look at the reward graph and see where it levels off, save for exploratory reward fluctuations. – Don Reba Apr 16 '14 at 02:18