0

Learner might be in training stage, where it update Q-table for bunch of epoch.

In this stage, Q-table would be updated with gamma(discount rate), learning rate(alpha), and action would be chosen by random action rate.

After some epoch, when reward is getting stable, let me call this "training is done". Then do I have to ignore these parameters(gamma, learning rate, etc) after that?

I mean, in training stage, I got an action from Q-table like this:

if rand_float < rar:
    action = rand.randint(0, num_actions - 1)
else:
    action = np.argmax(Q[s_prime_as_index])

But after training stage, Do I have to remove rar, which means I have to get an action from Q-table like this?

action = np.argmax(self.Q[s_prime])
user3595632
  • 5,380
  • 10
  • 55
  • 111

2 Answers2

2

Once the value function has converged (values stop changing), you no longer need to run Q-value updates. This means gamma and alpha are no longer relevant, because they only effect updates.

The epsilon parameter is part of the exploration policy (e-greedy) and helps ensure that the agent visits all states infinitely many times in the limit. This is an important factor in ensuring that the agent's value function eventually converges to the correct value. Once we've deemed the value function converged however, there's no need to continue randomly taking actions that our value function doesn't believe to be best; we believe that the value function is optimal, so we extract the optimal policy by greedily choosing what it says is the best action in every state. We can just set epsilon to 0.

Nick Walker
  • 790
  • 6
  • 19
  • One more thing. In training state, do I have to re-init `rar` to in every new epoch? If I do re-init `rar` in every epoch, it did not converge! .. .... think it would take so much time....? – user3595632 Apr 27 '17 at 06:56
  • As Pablo's answer notes, "you should decrease the epsilon parameters (rar in your case) with the number of episodes (or steps)." So don't reinitialize it every episode, just allow it to continue decaying. – Nick Walker May 04 '17 at 16:13
1

Although the answer provided by @Nick Walker is correct, here it's some additional information.

What you are talking about is closely related with the concept technically known as "exploration-exploitation trade-off". From Sutton & Barto book:

The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future. The dilemma is that neither exploration nor exploitation can be pursued exclusively without failing at the task. The agent must try a variety of actions and progressively favor those that appear to be best.

One way to implement the exploration-exploitation trade-off is using epsilon-greedy exploration, that is what you are using in your code sample. So, at the end, once the agent has converged to the optimal policy, the agent must select only those that exploite the current knowledge, i.e., you can forget the rand_float < rar part. Ideally you should decrease the epsilon parameters (rar in your case) with the number of episodes (or steps).

On the other hand, regarding the learning rate, it worths noting that theoretically this parameter should follow the Robbins-Monro conditions:

enter image description here

This means that the learning rate should decrease asymptotically. So, again, once the algorithm has converged you can (or better, you should) safely ignore the learning rate parameter.

In practice, sometimes you can simply maintain a fixed epsilon and alpha parameters until your algorithm converges and then put them as 0 (i.e., ignore them).

Pablo EM
  • 6,190
  • 3
  • 29
  • 37