2

I would like to train a gym model based on a custom environment. The training loop looks like this:

    obs = env.reset()
    for i in range(1000):
        action, _states = model.predict(obs, deterministic=True)
        print(f"action: {action}")
        obs, reward, done, info = env.step(action)
        env.render()
        if done:
            obs = env.reset() 

There are basic examples like this, e.g. here: https://stable-baselines.readthedocs.io/en/master/guide/examples.html

Somewhere else (within the environment class) I defined an action_space:

    self.action_space = spaces.Discrete(5)    

With this basic definition of the action_space the actions returned by model.predict for each step seem to be just numbers from 0 to 4.

Now - for making the question a little more practical - I assume, my environment describes a maze. My overall available actions in this case could be

realActions = [_UP, _DOWN, _LEFT, _RIGHT]

Now in a maze the available actions for each step are constantly changing. For example at the upper wall of the maze the actions would only be:

realActions = [_DOWN, _LEFT, _RIGHT]

So I would try to take this into consideration:

        env.render()
        realActions = env.getCurrentAvailableActions()
        #set gym action_space with reduced no. of options:
        self.action_space =  spaces.Discrete(len(realActions)) 

And in env.step I would execute realActions[action] in the maze to do the correct move.

Unfortunately the reassignment of self.action_space seems not to be recognized by my model.

There is another important point: the workaround to assign realActions instead of defining action_space itsself with this values could never train correctly, because the model never would know, which effect the action it generates would have to the maze, because it does not see the assignment from its own action to realActions.

So my question is: does stable baselines / gym provide a practicable way to limit the action_spaces to dynamically (per step) available actions?

Thank you!

Progman
  • 16,827
  • 6
  • 33
  • 48
Mike75
  • 504
  • 3
  • 18
  • 2
    Usually it is assumed that your action space is fixed during the training. However, there is a known technique called "action masking". If your model is PPO, I would advise you to look at [MaskablePPO](https://sb3-contrib.readthedocs.io/en/master/modules/ppo_mask.html) in sb3-contrib, that implements exactly what you want. If not, I know two general solutions: a) either you shape your reward function so that it backpropagates a bad reward each time you select an invalid action, or b) you disallow applying an invalid action either by replacing it with a random valid action or by doing nothing. – gehirndienst Dec 06 '22 at 11:49
  • Would it be good, to go on with: env.step(possibleAction), if the predicted action is not possible in the environment? May one could choose: possibleAction = random(realActions) ? – Mike75 Dec 08 '22 at 09:27
  • And then go on with: obs, reward, done, info = env.step(possibleAction) – Mike75 Dec 08 '22 at 10:00
  • it is a legit solution especially if you don't have an option "do nothing", e.g., in a grid env with `{up,down,left,right}` action space and if you don't want to change a model code (its action sampling or getting an action distribution depending on the type of chosen algorithm). Like I said above, action masking out of box works only with PPO in sb3. – gehirndienst Dec 08 '22 at 16:04

0 Answers0