I'm new to RL and I was hoping to get some advice from yol:
I created a custom environment that is a 10x10 grid world where the agent and its target destination (as well as some obstacles, namely: Fires) can be randomly placed. The state of the env that the model is trained on is just the Box numpy array representing the different position (0s for empty spaces, 1 for the target, etc).
what the world could look like
The PPO model (from stable_baselines3) is unable to learn now to navigate randomly generated worlds even after 5 million time steps of training (each reset of an environment creates new random world layout). Tensor-board is showing only a very slight average reward increase after all that training.
I am able to train the model effectively only I keep the world layout the same on every reset (so no random placement of the agent, etc).
So my question is: should PPO be in theory able to deal with random world generation like that or am I trying to make it do something that is beyond its capabilities?
More details: I'm using all default PPO parameters (with MlpPolicy).
The reward system is as follows:
- On every step the reward is -0.5 * distance between the agent (smily face) and the target ('$')
- If the agent is next to a fire ('X'), it gets -100 reward
- If the agent is is next to the target ('$'), it gets reward of 1000 and the episode ends
Max of 200 steps per episode.