I tried to use the following paper to improve the learning of my agent https://arxiv.org/pdf/1511.05952.pdf
While it seems to work very well on deterministic environment, I feel like it would actually make it worse in a stochastic one.
Let's assume For action A_w at state S_w we get 50% chance to get a reward of +1000000 and 50% chance to get a reward of -1000000 (and negligible deterministic reward in other states). The true Q value for that action would therefore be 0.
When training on either one of the possible samples (assuming both cases are in the replay memory), the priority of these samples will be set to 1000000 and therefore the probability to pick those samples for the upcoming updates will tend to 1 (each of them oscillating between 0 and 1) if we don't add new samples to the replay memory.
The other samples will therefore never be trained on.
My question is: how do we deal with that? Should I simply discard using this technique for such an environment?