there is a simple formula, which is always true for on-policy algos in sb:
n_updates = total_timesteps // (n_steps * n_envs)
from that it follows that n_steps
is the number of experiences which is collected from a single environment under the current policy before its next update. My subjective basic practice is to set this value to be equal to the episode length, especially if there is a terminal reward.
Then, there is slight terms misusage in sb3. Actually, a batch for PPO is the same as the size of rollout buffer, which is equal to n_steps * n_envs
. What is however meant by batch_size
is in turn the minibatch size, where your take some subset of your buffer (batch) with random shuffling. Lots of people set batch_size = n_steps
, so that your networks consume the whole batch, which may be the case when you have enough video memory and exploit population-based training. However, a standard practice is to use the smaller sized minibatches so that n_steps
is divisible by it, like in default parameters for PPO in sb3.