0

I've been learning Q learning from the youtube lecture below https://www.youtube.com/watch?v=Gq1Azv_B4-4&list=PLlMOxjd7OfgNxJSgF8pAs3_qMion-X1QI&index=2

In this tutorial, the guy uses epsilon methodology like this(I cut the details out)

import gym
import numpy as np
env = gym.make("MountainCar-v0")
EPISODES = 2000
epsilon = 0.5
START_EPSILON_DECAYING = 1
END_EPSILON_DECAYING = EPISODES // 2
epsilon_decay_value = epsilon / (END_EPSILON_DECAYING - START_EPSILON_DECAYING) #this part is very confusing to me 
for episode in range(EPISODES):
   done = False
   while not done:

      if np.random.random() > epsilon:  
          action = np.argmax(q_table[discrete_state])
      else:      
          action = np.random.randint(0, env.action_space.n)

      if END_EPSILON_DECAYING >= episode >= START_EPSILON_DECAYING:
          epsilon -= epsilon_decay_value

I could somewhat understand the concept of epsilon greedy but I haven't faintest idea how to apply it when program it. What I understood is 'epsilon greedy' is to balance between exploration and exploitation. But I don't know why epsilon should be diminished and what decides epsilon decay value formula.

Baaam Park
  • 415
  • 1
  • 5
  • 7

1 Answers1

2

Epsilon becomes diminished because as your model explores and learns, it becomes less and less important to explore and more and more important to follow your learned policy. Imagine this scenario: If your model still "explores" after learning a policy, it may very much choose an action it knows to be a poor choice. The whole idea of using epsilon-greedy is because it helps in the learning process, not the decision-making process.

Epsilon decay typically follows an exponential decay function, meaning it becomes multiplied by a percentage after every x episodes. I believe sentdex actually provides one later in his video/s. The key factor in determining your epsilon decay function is typically the scale at which it decays (in the exponential case, what percentage does it decay, and after how many episodes do you decay it?). There's also the question as to whether or not your environment would be beneficial to flooring the function as well.

M Z
  • 4,571
  • 2
  • 13
  • 27
  • But I'm uncertain why epsilon should be diminished. What I guess is that if epsilon keeps going down, the agent is likely to exploit, because in the very end, the agent will not explore because the agent update q table enough to solve an optimal choice at every state. This is just my guess and I hope you can tell whether I'm right or not. Am I getting somewhere? – Baaam Park Aug 02 '20 at 14:45
  • yes, that is correct. It is also the case that a lot of exploration leads to an agent that actually never reaches the end-goal state and is therefore actually not learning well. However, there is research on resetting epsilon in late-game stages as well, since this way there may lead to a lack of exploration in that stage of the environment. – M Z Aug 02 '20 at 15:43
  • Thank you so much! You helped me a lot. I feel indebted bro! – Baaam Park Aug 02 '20 at 16:06