Q learning - epsilon greedy update

Question

I am trying to understand the epsilon - greedy method in DQN. I am learning from the code available in https://github.com/karpathy/convnetjs/blob/master/build/deepqlearn.js

Following is the update rule for epsilon which changes with age as below:

$this.epsilon = Math.min(1.0, Math.max(this.epsilon_min, 1.0-(this.age - this.learning_steps_burnin)/(this.learning_steps_total - this.learning_steps_burnin)));

Does this mean the epsilon value starts with min (chosen by user) and then increase with age reaching upto burnin steps and eventually becoming to 1? Or Does the epsilon start around 1 and then decays to epsilon_min ?

Either way, then the learning almost stops after this process. So, do we need to choose the learning_steps_burnin and learning_steps_total carefully enough? Any thoughts on what value needs to be chosen?

score 5 · Answer 1 · answered Feb 06 '18 at 15:17

Since epsilon denotes the amount of randomness in your policy (action is greedy with probability 1-epsilon and random with probability epsilon), you want to start with a fairly randomized policy and later slowly move towards a deterministic policy. Therefore, you usually start with a large epsilon (like 0.9, or 1.0 in your code) and decay it to a small value (like 0.1). Most common and simple approaches are linear decay and exponential decay. Usually, you have an idea of how many learning steps you will perform (what in your code is called learning_steps_total) and tune the decay factor (your learning_steps_burnin) such that in this interval epsilon goes from 0.9 to 0.1.

Your code is an example of linear decay. An example of exponential decay is

epsilon = 0.9
decay = 0.9999
min_epsilon = 0.1
for i from 1 to n
    epsilon = max(min_epsilon, epsilon*decay)

are there any difference in performing linear or exponential decay over the same number of time steps? — AleB, May 08 '20 at 23:30
@AleB There is no rule of thumb, it really depends on the algorithm and on the environment. It's all hyperparameters optimization. — Simon, May 14 '20 at 11:09
I think this is a power law decay, actually, not an exponential one. — Tropilio, Jul 03 '20 at 16:31

score 0 · Answer 2 · answered Mar 06 '22 at 19:59

Personally I recommend an epsilon decay such that after about 50/75% of the training you reach the minimum value of espilon (advice from 0.05 to 0.0025) from which then you have only the improvement of the policy itself. I created a specific script to set the various parameters and it returns after what the decay stop is reached (at the indicated value)

import matplotlib.pyplot as plt
import numpy as np

eps_start = 1.0
eps_min = 0.05
eps_decay = 0.9994
epochs = 10000
pct = 0
df = np.zeros(epochs)
for i in range(epochs):
    if i == 0:
        df[i] = eps_start
    else:
        df[i] = df[i-1] * eps_decay
        if df[i] <= eps_min:
            print(i)
            stop = i
            break

print("With this parameter you will stop epsilon decay after {}% of training".format(stop/epochs*100))
plt.plot(df)
plt.show()

Q learning - epsilon greedy update

2 Answers2