Measuring episode rewards when using epsilon greedy policy with linear annealing on epsilon

Question

Is there a standard practice or a tool in Keras that will give an estimate of the episode rewards that is decorrelated with epsilon during training?

In training the following dqn network, I can measure the episode rewards over time during training, however due the nature of the problem, as epsilon decreases the episode rewards will increase regardless of whether or not the model has improved from training. Because of this, it is difficult to tell if model is improving/converging, or if the increasing episode rewards is just due to the linear annealing of epsilon.

If I had to work around this manually, I would train a fraction of the total desired training steps, then test the model with epsilon = 0, record the average episode rewards at that instant, manually change epsilon, and then do the same cycle again. This seems like a hack though and I would think that anyone else using linear annealing of epsilon would run into this same issue.

Thoughts?

My model is constructed as follows:

model = Sequential()
model.add(Flatten(input_shape=(WINDOW_LENGTH,) + (observation_space_count,)))
for i in range(hidden_layer_count):
    model.add(Dense(observation_space_count*layer_width))
    model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))

memory = SequentialMemory(limit=memory_length, window_length=WINDOW_LENGTH)
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), attr='eps', value_max=0.75, value_min=.01, value_test=.0, nb_steps=TOTAL_STEPS)

dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=memory_length, target_model_update=1e-2, policy=policy, gamma=.99)
dqn.compile(Adam(lr=LEARNING_RATE), metrics=['mae'])

A typical training graph may look like: typical training metrics

Measuring episode rewards when using epsilon greedy policy with linear annealing on epsilon

0 Answers0