I've been learning Q learning from the youtube lecture below https://www.youtube.com/watch?v=Gq1Azv_B4-4&list=PLlMOxjd7OfgNxJSgF8pAs3_qMion-X1QI&index=2
In this tutorial, the guy uses epsilon methodology like this(I cut the details out)
import gym
import numpy as np
env = gym.make("MountainCar-v0")
EPISODES = 2000
epsilon = 0.5
START_EPSILON_DECAYING = 1
END_EPSILON_DECAYING = EPISODES // 2
epsilon_decay_value = epsilon / (END_EPSILON_DECAYING - START_EPSILON_DECAYING) #this part is very confusing to me
for episode in range(EPISODES):
done = False
while not done:
if np.random.random() > epsilon:
action = np.argmax(q_table[discrete_state])
else:
action = np.random.randint(0, env.action_space.n)
if END_EPSILON_DECAYING >= episode >= START_EPSILON_DECAYING:
epsilon -= epsilon_decay_value
I could somewhat understand the concept of epsilon greedy but I haven't faintest idea how to apply it when program it. What I understood is 'epsilon greedy' is to balance between exploration and exploitation. But I don't know why epsilon should be diminished and what decides epsilon decay value formula.