0

I tried to use the following paper to improve the learning of my agent https://arxiv.org/pdf/1511.05952.pdf

While it seems to work very well on deterministic environment, I feel like it would actually make it worse in a stochastic one.

Let's assume For action A_w at state S_w we get 50% chance to get a reward of +1000000 and 50% chance to get a reward of -1000000 (and negligible deterministic reward in other states). The true Q value for that action would therefore be 0.

When training on either one of the possible samples (assuming both cases are in the replay memory), the priority of these samples will be set to 1000000 and therefore the probability to pick those samples for the upcoming updates will tend to 1 (each of them oscillating between 0 and 1) if we don't add new samples to the replay memory.

The other samples will therefore never be trained on.

My question is: how do we deal with that? Should I simply discard using this technique for such an environment?

user3548298
  • 186
  • 1
  • 1
  • 13

1 Answers1

0

The authors of the paper seem to address this issue in a couple of places. Most importantly they mention reward clipping:

Rewards and TD-errors are clipped to fall within [−1, 1] for stability reasons.

This means that if the reward is 1000000, then they clip it to 1, and if it is -1000000, they clip it to -1. Rewards between -1 and 1 are unchanged.

In general, Deep Q-learning algorithms are very unstable with extreme reward values. Since these are used in backpropagation the parameters of the model are likely to be disrupted severely by large TD error values, making it hard for the algorithm to converge. For this reason reward or gradient clipping is commonly used.

The paper also mentions that greedy prioritisation is sensitive to noise spikes due to stochastic rewards, and that their method is an interpolation between greedy prioritisation and a uniform method. They use an alpha parameter in equation (1) to make the policy less greedy, if stochasticity is causing a problem this might help. They also discuss rank based prioritisation as being more robust to error magnitudes and outliers in Section 5, and say it may not be needed due to "heavy use of clipping".

It is also possible that the method is more tuned towards deterministic rewards - they also mention that the environments they tested on (Atari games) were "near-deterministic".

On a broader point, the high discrepancy of the reward suggests there is something to learn in the transition you highlight - it seems like you can either win or lose the game on the basis of that transition. If that is the case, the algorithm (which does not know if the game is deterministic or stochastic) will spend an awful lot of time trying to learn about that transition. This makes sense if you want to learn to win the game, but in this case, the game seems to be random so there is nothing to learn.

dilaudid
  • 163
  • 2
  • 6
  • The whole reward not clipping is just for highlighting the problem ... we could transform the same idea by using many stochastic clipped transitions which as a whole would prevent learning from new transitions even though their Q value is already at their true value. – user3548298 Jun 21 '20 at 20:32