2

In DQN paper of DeepMind company, there are two loops one for episodes and one for running time in each step (one for training and one for different time-step of running). Am I right?

Since, nothing is done in outer loop except initialization and reset to conditions of first step, what are their differences?

For instance, in case 1, if we run for 1000 episodes and 400 time steps what are the differences we should expected in case 2, if we run for 4000 episodes and 100 time steps?

(is their difference that the second one has more chance to get rid of local minimum or something similar to that? or both are the same?)

Another question is where updating the experience replay is investigated?

enter image description here

Sa Ra
  • 49
  • 1
  • 7

1 Answers1

1

For your first question: the answer is yes, there are two loops, and they do have differences.

You have to think of the true meaning of an episode. In most cases, we can consider each episode a 'game'. A 'game' needs to have an end. And we need to do our best to let every game end within the length of an episode (imagine what you can learn if you cannot get out of a labyrinth game). The Q values of DQN is an approximation of 'current reward' + 'discounted future rewards', while you need to know when will the future ends to make a better approximation.

So assume we usually take 200 steps to finish the game, then an episode of 100 time steps has a huge difference from an episode of 400 time steps.

For experience replay update, it happens in every time step. I don't get what you're asking. If you can explain your question in detail I think I could answer it.

Kevin Fang
  • 1,966
  • 2
  • 16
  • 31
  • Thanks Kevin, but I did not found answer of difference between case 1 and case 2?? I need to explain this difference in more detail?which one is more useful? In my game, the game is not finished, I want to find min cost during the time steps. So, I think this is not their differences. Regarding replay, I asked that experience replay is investigated in each time step or each episodes?(inside the inner loop or outside it?) – Sa Ra Aug 09 '18 at 02:17
  • Quick answer to the last question, in each time step. See the 5th and 6th step of the inner loop in your pic. – Kevin Fang Aug 09 '18 at 02:26
  • For the first question, how do you reward your model? – Kevin Fang Aug 09 '18 at 02:29
  • based on a cost function of parameters to select the optimal case in each time step! – Sa Ra Aug 09 '18 at 02:42
  • Is it a Markov Decision Process? Anyway, if your 'game' is an infinite process that does not have certain 'start' or 'end' then it makes no difference for case 1 and 2. While in most DQN applications there are certain start and end conditions where case 1 and 2 can be quite different. – Kevin Fang Aug 09 '18 at 03:24
  • what is their difference? Semi markov! I want to find optimal for instance in 100 steps! and apply DQN to do this! – Sa Ra Aug 09 '18 at 07:54
  • If you terminate your episode at step 100 then your model will learn how to get the best reward within 100 steps. According to your description, I suppose you want the model to converge to the optimal as soon as possible, preferably within 100 steps. So I think limiting the episode length to 100 would be a good idea. If you set it to 400, then it will behave differently to reach a better overall reward within 400 steps. – Kevin Fang Aug 09 '18 at 23:54
  • You need to carefully guide your RL model to do what you want it to do. Reward is not necessarily the final goal, but the model will only chase for max reward if you tune it well – Kevin Fang Aug 09 '18 at 23:55
  • Thanks Kevin, reward is not the only important parameters. state is combined of 5 vars. so it is important these variables follow the optimal answer (I know the optimal answer in special case)!What is still question is that I did not understand why you said bigger episodes is better than better time steps! I thought vice versa because if my training leads to find the weights of network which are optimal or close to optimal in bigger time steps, the accuracy of algorithm is proven more! What do you think? – Sa Ra Aug 10 '18 at 06:54
  • The RL model does not know what is your 'optimal', all it knows is the reward. So your model will only head for best reward, regardless of optimal strategy or not. – Kevin Fang Aug 10 '18 at 07:01
  • No I think there is misundersting! Minimize the reward leads to find optimal so if the answer track the optimal it is shown than RL works better or equal to optimal! and it is good results! – Sa Ra Aug 10 '18 at 07:02
  • If you want your model to converge to 'optimal' strategy, you have to guide it so that 1) max reward can and only can be achieved by optimal strategy. 2) your model develops a long way towards your optimal strategy, so you have to make sure there's nothing that stops it along this way – Kevin Fang Aug 10 '18 at 07:03
  • Another question is that is it reasonable expectation from a dqn that its cost function is c1x1+c2X2+c3X3 to be robust if the c1,c2,c3 is changed?or being robust if one of constraints is changed? I think no, the training for updating the weight of dnn is needed at least! At least it needs the transfer learning to find the new weight of dnn, What about your idea? – Sa Ra Aug 10 '18 at 07:05
  • Actually DQN generally use MSE of 'model prediction of Q values' and 'target Q values', where target Q values come from your target Q network and updated by the observation, this loss is convex and learnable, while your linear cost function is not convex to me, so I don't think that's a good loss function – Kevin Fang Aug 10 '18 at 07:09