Deep Q learning Replay method Memory Vanishing

Question

In the Q-learning algorithm used in Reinforcement Learning with replay, one would use a data structure in which it stores previous experience that is used in training (a basic example would be a tuple in Python). For a complex state space, I would need to train the agent in a very large number of different situations to obtain a NN that correctly approximates the Q-values. The experience data will occupy more and more memory and thus I should impose a superior limit for the number of experience to be stored, after which the computer should drop the experience from memory.

Do you think FIFO (first in first out) would be a good way of manipulating the data vanishing procedure in the memory of the agent (that way, after reaching the memory limit I would discard the oldest experience, which may be useful for permitting the agent to adapt quicker to changes in the medium)? How could I compute a good maximum number of experiences in the memory to make sure that Q-learning on the agent's NN converges towards the Q function approximator I need (I know that this could be done empirically, I would like to know if an analytical estimator for this limit exists)?

DeepMind used a random shuffling approach to handling the experience replay method (emulating biological recall) (http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html), for which you could use a FIFO store/delete approach and train on a random subset at each re-learning step as the random sampling does not need anything fancier. It may be interesting to try and use your reward value as a sorting weight and drop your lowest data, though you may introduce overfitting if your data set is too small or too localized. — dblclik, Jun 27 '18 at 19:16
I recommend you move your question to CrossValidated (stats.stackexchange.com) as you'll get more interest and reception — dblclik, Jun 28 '18 at 12:24

score 0 · Accepted Answer · answered Jul 16 '18 at 15:03

In the preeminent paper on "Deep Reinforcement Learning", DeepMind achieved their results by randomly selecting which experiences should be stored. The rest of the experiences were dropped.

It's hard to say how a FIFO approach would affect your results without knowing more about the problem you're trying to solve. As dblclik points out, this may cause your learning agent to overfit. That said, it's worth trying. There very well may be a case where using FIFO to saturate the experience replay would result in an accelerated rate of learning. I would try both approaches and see if your agent reaches convergence more quickly with one.

Deep Q learning Replay method Memory Vanishing

1 Answers1