Training PPO from stable_baselines3 on a grid world that randomizes

Question

I'm new to RL and I was hoping to get some advice from yol:

I created a custom environment that is a 10x10 grid world where the agent and its target destination (as well as some obstacles, namely: Fires) can be randomly placed. The state of the env that the model is trained on is just the Box numpy array representing the different position (0s for empty spaces, 1 for the target, etc).

what the world could look like

The PPO model (from stable_baselines3) is unable to learn now to navigate randomly generated worlds even after 5 million time steps of training (each reset of an environment creates new random world layout). Tensor-board is showing only a very slight average reward increase after all that training.

I am able to train the model effectively only I keep the world layout the same on every reset (so no random placement of the agent, etc).

So my question is: should PPO be in theory able to deal with random world generation like that or am I trying to make it do something that is beyond its capabilities?

More details: I'm using all default PPO parameters (with MlpPolicy).

The reward system is as follows:

On every step the reward is -0.5 * distance between the agent (smily face) and the target ('$')
If the agent is next to a fire ('X'), it gets -100 reward
If the agent is is next to the target ('$'), it gets reward of 1000 and the episode ends

Max of 200 steps per episode.

I tried to switch to 36x36 grid so that I can use the `CnnPolicy` with PPO but again, after 4h of training and 5m time-steps, the model didn't seem to learn to learn much. It is as if it was not able to see the target on the map / image. Like before, when I keep the map / image consistent (so there world does not generate randomly with every episode), the model learns very fast. But that is obviously a case of over-fitting. How is this thing (PPO) ever able to learn more complex environments where things move (various Atari games, etc). I am just surprised. Should I not be? — Pete, Nov 24 '22 at 19:43

gehirndienst · Answer 1 · 2022-12-08T09:20:50.320

I would rather try good old off-policy deterministic solutions like DQN for that task, but on-policy stochastic PPO should solve it as well. I recommend to change three things in your design, maybe changing them could help your training.

First, your reward signal design probably "embarrasses" your network: you have an enormously big positive terminal reward while trying to push your agent to this terminal state asap with small punishments. I would definetely suggest reward normalization for PPO.

Second, if you don't fine-tune your PPO hyperparameters, your entropy coefficient ent_coef remains 0.0, however entropy part of your loss function could be very useful in your env. I would try 0.01 at least.

Third, PPO would be really enhanced in your case (in my opinion) if you change mlp policy to a recurrent one. Take a look at RecurrentPPO

Training PPO from stable_baselines3 on a grid world that randomizes

1 Answers1