- env.close() is dependent on the environment, so it will do different things for each one. It basically is used to stop rendering the game, as seen in the code here.
def close(self):
if self.screen is not None:
import pygame
pygame.display.quit()
pygame.quit()
self.isopen = False
total_timesteps
is the total amount of timesteps you'd like to train your agent. n_steps
is the parameter that is used to decide how often to update a model. Feel free to look at the documentation if you're still confused.
- From the documentation, verbosity is how much is printed out about the model at each update:
verbose (int) – the verbosity level: 0 no output, 1 info, 2 debug
- No. As you are only calling the model's
predict()
function, this question doesn't make sense since you can easily call the predict()
function immediately afterward with deterministic=False
and you would get a stochastic action. The model itself is neither stochastic nor deterministic. PPO, in particular, gets actions by first inputting an observation to the actor's network, which will output what are called logits, or unnormalized log action probabilities. Those logits are then passed through a Categorical distribution which is essentially a Softmax() operation to get the action probability distribution. A simple pseudocode example to get actions from the policy's network would be as follows:
logits = policy_network(observation)
probs = Categorical(logits=logits)
actions = probs.sample()
As you can see from this code:
def get_actions(self, deterministic: bool = False) -> th.Tensor:
"""
Return actions according to the probability distribution.
:param deterministic:
:return:
"""
if deterministic:
return self.mode()
return self.sample()
Stable Baselines uses the deterministic
input you mentioned to either call the Categorical distribution's mode()
function or it's sample()
function. The code from both are in Pytorch's documentation:
def sample(self, sample_shape=torch.Size()):
if not isinstance(sample_shape, torch.Size):
sample_shape = torch.Size(sample_shape)
probs_2d = self.probs.reshape(-1, self._num_events)
samples_2d = torch.multinomial(probs_2d, sample_shape.numel(), True).T
return samples_2d.reshape(self._extended_shape(sample_shape))
def mode(self):
return self.probs.argmax(axis=-1)
As you can see, the Categorical distribution's sample()
function just calls Pytorch's torch.multinomial
distribution, which will return a random sample from the multinomial distribution, which is what makes your actions stochastic when deterministic=False
. On the other hand, the Categorical distribution's mode()
function just performs an argmax()
operation, which has no randomness and is therefore deterministic. Hopefully that explanation was not too complicated.
- This isn't something that can be simply done by passing a parameter, you need to implement a StopTrainingOnMaxEpisodes custom callback, as per the documentation. There is also a simple code example in the documentation that I will just echo here for clarity:
from stable_baselines3 import A2C
from stable_baselines3.common.callbacks import StopTrainingOnMaxEpisodes
# Stops training when the model reaches the maximum number of episodes
callback_max_episodes = StopTrainingOnMaxEpisodes(max_episodes=5, verbose=1)
model = A2C('MlpPolicy', 'Pendulum-v1', verbose=1)
# Almost infinite number of timesteps, but the training will stop
# early as soon as the max number of episodes is reached
model.learn(int(1e10), callback=callback_max_episodes)
If you're confused at all about how PPO works or want more questions answered about that, I highly highly HIGHLY suggest reading this article which fully explains how PPO is actually implemented in code, linking all the papers that explain the intuition behind how PPO was created and comes with extremely helpful videos.