0

I have read the DQN thesis.

While reading the DQN paper, I found that randomly selecting and learning samples reduced divergence in RL using a non-linier function approximator.

If so, why is the learning of RL using a non-linier function approximator divergent when the input data are strongly correlated?

강문주
  • 3
  • 1
  • This is off-topic, its not a programming question. – Dr. Snoopy Jan 28 '20 at 07:35
  • This question should be asked in [ai stack](https://ai.stackexchange.com/) or [stats stack](https://stats.stackexchange.com/) – hola Jan 28 '20 at 10:23
  • Thank you. guys. It is my first time I asked a question. Thank you for your kind explanation – 강문주 Feb 03 '20 at 03:08
  • @강문주 If my answer below has solved your question, please consider [accepting it](https://meta.stackexchange.com/q/5234/179419) by clicking the check mark next to it. This indicates to the wider community that you've found a solution. – Brett Daley Feb 03 '20 at 03:26

1 Answers1

0

I believe that Section X (starting on page 687) of An Analysis Of Temporal-Difference Learning with Function Approximation provides an answer to your question. In summary, there exist nonlinear functions whose average prediction error actually increases after applying the TD(0) Bellman operator; hence, the policy will eventually diverge. This is generally the case for deep neural networks because they are inherently nonlinear and tend to be poorly behaved from an optimization perspective.

Alternatively, training on independent and identically distributed (i.i.d.) data makes it possible to compute unbiased estimates of the gradient, which is required for first-order optimization algorithms like Stochastic Gradient Descent (SGD) to converge to a local minimum of the loss function. This is why DQN samples random minibatches from a large replay memory then reduces the loss using RMSProp (an advanced form of SGD).

Brett Daley
  • 544
  • 3
  • 6