In DQN paper of DeepMind company, there are two loops one for episodes and one for running time in each step (one for training and one for different time-step of running). Am I right?
Since, nothing is done in outer loop except initialization and reset to conditions of first step, what are their differences?
For instance, in case 1, if we run for 1000 episodes and 400 time steps what are the differences we should expected in case 2, if we run for 4000 episodes and 100 time steps?
(is their difference that the second one has more chance to get rid of local minimum or something similar to that? or both are the same?)
Another question is where updating the experience replay is investigated?