I am trying to implement PPO from stable baselines3 for my custom environment, I don't understand some commands?

Question

I don't understand the following:

env.close() #what does this mean?
model.learn(total_timesteps=1000) # are total_steps here the number of steps after which the neural network model parameters are updated (i.e. number of time-steps per episode)?
model = PPO(MlpPolicy, env, verbose=1) # what is the meaning of verbose=1 here?
action, _state = model.predict(obs, deterministic=True) # what is deterministic=True doing here? Does deterministic=True mean that policy is deterministic and not stochastic?
Where can I state the number of episodes for which I want to run my experiment?

for i in range(1000):

`action, _states = model.predict(obs)`

`obs, rewards, dones, info = env.step(action)`

`env.render()`

Is 1000 here number of episodes?

Please if someone can clarify these.

score 1 · Answer 1 · answered Sep 06 '22 at 03:54

env.close() is dependent on the environment, so it will do different things for each one. It basically is used to stop rendering the game, as seen in the code here.

    def close(self):
        if self.screen is not None:
            import pygame

            pygame.display.quit()
            pygame.quit()
            self.isopen = False

total_timesteps is the total amount of timesteps you'd like to train your agent. n_steps is the parameter that is used to decide how often to update a model. Feel free to look at the documentation if you're still confused.
From the documentation, verbosity is how much is printed out about the model at each update:

verbose (int) – the verbosity level: 0 no output, 1 info, 2 debug

No. As you are only calling the model's predict() function, this question doesn't make sense since you can easily call the predict() function immediately afterward with deterministic=False and you would get a stochastic action. The model itself is neither stochastic nor deterministic. PPO, in particular, gets actions by first inputting an observation to the actor's network, which will output what are called logits, or unnormalized log action probabilities. Those logits are then passed through a Categorical distribution which is essentially a Softmax() operation to get the action probability distribution. A simple pseudocode example to get actions from the policy's network would be as follows:

logits = policy_network(observation)
probs = Categorical(logits=logits)
actions = probs.sample()

As you can see from this code:

    def get_actions(self, deterministic: bool = False) -> th.Tensor:
        """
        Return actions according to the probability distribution.
        :param deterministic:
        :return:
        """
        if deterministic:
            return self.mode()
        return self.sample()

Stable Baselines uses the deterministic input you mentioned to either call the Categorical distribution's mode() function or it's sample() function. The code from both are in Pytorch's documentation:

    def sample(self, sample_shape=torch.Size()):
        if not isinstance(sample_shape, torch.Size):
            sample_shape = torch.Size(sample_shape)
        probs_2d = self.probs.reshape(-1, self._num_events)
        samples_2d = torch.multinomial(probs_2d, sample_shape.numel(), True).T
        return samples_2d.reshape(self._extended_shape(sample_shape))

    def mode(self):
        return self.probs.argmax(axis=-1)

As you can see, the Categorical distribution's sample() function just calls Pytorch's torch.multinomial distribution, which will return a random sample from the multinomial distribution, which is what makes your actions stochastic when deterministic=False. On the other hand, the Categorical distribution's mode() function just performs an argmax() operation, which has no randomness and is therefore deterministic. Hopefully that explanation was not too complicated.

This isn't something that can be simply done by passing a parameter, you need to implement a StopTrainingOnMaxEpisodes custom callback, as per the documentation. There is also a simple code example in the documentation that I will just echo here for clarity:

from stable_baselines3 import A2C
from stable_baselines3.common.callbacks import StopTrainingOnMaxEpisodes

# Stops training when the model reaches the maximum number of episodes
callback_max_episodes = StopTrainingOnMaxEpisodes(max_episodes=5, verbose=1)

model = A2C('MlpPolicy', 'Pendulum-v1', verbose=1)
# Almost infinite number of timesteps, but the training will stop
# early as soon as the max number of episodes is reached
model.learn(int(1e10), callback=callback_max_episodes)

If you're confused at all about how PPO works or want more questions answered about that, I highly highly HIGHLY suggest reading this article which fully explains how PPO is actually implemented in code, linking all the papers that explain the intuition behind how PPO was created and comes with extremely helpful videos.

I am trying to implement PPO from stable baselines3 for my custom environment, I don't understand some commands?

1 Answers1