I'm trying to read through the PPO1 code in OpenAi's Baselines implementation of RL algorithms (https://github.com/openai/baselines) to gain a better understanding as to how PPO works, how one might go about implementing it, etc.
I'm confused as to the difference between the "optim_batchsize" and the "timesteps_per_actorbatch" arguments that are fed into the "learn()" function. What are these hyper-parameters?
In addition, I see in the "run_atari.py" file, the "make_atari" and "wrap_deepmind" functions are used to wrap the environment. In the "make_atari" function, it uses the "EpisodicLifeEnv", which ends the episode once the a life is lost. On average, I see that the episode length in the beginning of training is about 7 - 8 timesteps, but the batch size is 256, so I don't see how any updates can occur. Thanks in advance for your help.