0

I have the following reinforcement learning problem (simplified) with continuous actions and state variables: config

I have created a custom environment with gymnasium (new gym openai). Every time I reset my env, between 2 and 5 balls spawn randomly in a box of 100x100 size. One of those balls (the red one) will receive an action (direction of movement) and will move according to some physics. This ball will always spawn.

Notice that the observation space only changes when we reset (when we call the step function it remains the same).

Action space:

1) The angle (float value) that we want to move the red ball (theta in the image).

Observation space:

1) the coordinates of the red ball (np array of floats)
2) the coordinates of the blue ball 1 (np array of floats)
3) the coordinates of the blue ball 2 (np array of floats)
4) the coordinates of the blue ball 3 (np array of floats)
5) the coordinates of the blue ball 4 (np array of floats)

The problem is that sometimes there is no blue ball 2 or blue ball 3 or blue ball 4 (because maybe only two balls will spawn when resetting)

Therefore, every time the reset function in my env is called the size of the observation will change. This is causing me problems when I use methods like AC2 or PPO from stable-baselines3, because during learning phase the reset method is called.

How can I solve this issue? I read about padding with zeros but I am not sure adding a [0,0] (which is a valid position) for every blue ball that has not spawned will help, and I am not able to understand other solutions I have read.

Every advice on how to handle this particular case is welcome.

0 Answers0