I am using rlglue based python-rl framework for q-learning. My understanding is that over number of episodes, the algorithm converges to an optimal policy (which is a mapping which says what action to take in what state).
Question1: Does this mean that after a number of episodes ( say 1000 or more ) I should essentially get the same state:action mapping?
When I plot the rewards (or rewards averaged over 100 episodes) I get a graph similar to Fig 6.13 in this link.
Question2: If the algorithm has converged to some policy why does the rewards fall down? Is there a possibility that the rewards vary drastically?
Question3: Is there some standard method which I can use to compare the results of various RL algorithms?