2

The A3C Algorithm (and N-Step Q Learning) updates the globaly shared network once every N timesteps. N is usually pretty small, 5 or 20 as far as I remember.

Wouldn't it be possible to set N to infinity, meaning that the networks are only trained at the end of an episode? I do not argue that it is necessarily better - tough, for me it sounds like it could be - but at least it should not be a lot worse, right?

The lacking asynchronous training based on the asynchronous exploration of the enviroment by multiple agents in different enviroments, and therefore the stabilization of the training procedure without replay memory, might be a problem if the training is done sequentially (as in: for each worker thread, train the network on the whole observed SAR-sequence). Tough, the training could still be done asynchronously with sub-sequences, it would only make training with stateful LSTMs a little bit more complicated.

The reason why I am asking is the "Evolution Strategies as a Scalable Alternative to Reinforcement Learning" paper. To compare it to algorithms like A3C, it would make more sense - from a code engineering point of view - to train both algorithms in the same episodic way.

Another Coder
  • 309
  • 3
  • 10

1 Answers1

1

Definitely, just set N to be larger than the maximum episode length (or modify the source to remove the batching condition. Note that in the original A3C paper, this is done with the dynamic control environments (with continuous action spaces) with good results. It is commonly argued that being able to update mid-episode (not necessary) is a key advantage of TD methods: it uses the Markov condition of MDP.

Falcon
  • 1,317
  • 1
  • 13
  • 30
  • Thanks for the input :-) I guess you refere to this part of the paper: "[...] since the episodes were typically at most several hundred time steps long we did not use any bootstrapping in the policy or value function updates and batched each episode into a single update"? This indeed reads as if the mid-episode updating is more of an efficiency thing. Good to know. But then I don't understand the last sentence of your reply. Do you say that mid-episodic update is an advantage in terms of efficiency ("nice to have"), but not in terms of final fitness? Thanks!! – Another Coder Mar 21 '17 at 11:49
  • 1
    yes, data efficiency (you update more frequently, trading off variance since the updates would be based on less data). in the case of non-episodic task, i.e. episode length being infinite, you will have to perform update *mid-episode*. there is a legitimate concern about bias due to the reward discount (gamma < 1): mid-episode updates effectively place higher gamma for late rewards than full-episode updates. – Falcon Mar 22 '17 at 18:21
  • Thanks again :-) I hope you don't mind if I ask another one. So, you say that there is a difference in bias, but if there are, for example 20 steps in that episode and N is 5, then: * In the mid-episode-case, I would update 4 times with 5 SAR pairs. * In the episodic-case, I would update once with 20 SAR pairs. So, shouldnt the only difference be that I do not bootrstap the final reward from the value-part of the network in the episodic context, but instead use the "real" final value? Or is that exactly the difference that introduces bias? – Another Coder Mar 23 '17 at 08:32
  • 1
    No, i meant if you are using a gamma-discounted reward in your training, during each mid-episode update, you are weighting the rewards in that 5 frames as if you have started at the first frame of that 5-frame batch, instead of the first frame of the 20-frame episode. This effectively over-weights the later frames. – Falcon Mar 24 '17 at 04:31
  • 1
    Thank you very much, I think I understand now :-) – Another Coder Mar 24 '17 at 10:35