0

I'm trying to read through the PPO1 code in OpenAi's Baselines implementation of RL algorithms (https://github.com/openai/baselines) to gain a better understanding as to how PPO works, how one might go about implementing it, etc.

I'm confused as to the difference between the "optim_batchsize" and the "timesteps_per_actorbatch" arguments that are fed into the "learn()" function. What are these hyper-parameters?

In addition, I see in the "run_atari.py" file, the "make_atari" and "wrap_deepmind" functions are used to wrap the environment. In the "make_atari" function, it uses the "EpisodicLifeEnv", which ends the episode once the a life is lost. On average, I see that the episode length in the beginning of training is about 7 - 8 timesteps, but the batch size is 256, so I don't see how any updates can occur. Thanks in advance for your help.

ashboy64
  • 83
  • 3
  • 13

1 Answers1

0

I've been going through it on my own as well....their code is a nightmare! optim_batchsize is the batch size used for optimizing the policy, timesteps_per_actorbatch is the number of time steps the agent runs before optimizing.

On the episodic thing, I am not sure. Two ways it could happen, one is waiting until the 256 entries are filled before actually updating, or the other one is filling the batch with dummy data that does nothing, effectively only updating the 7 or 8 steps that the episode lasted.

SymX
  • 11
  • 1