A parametric/variable-length action model is provided in rllib examples. The example assumes the outputs are logits for a single Categorical action dist. How to getting this work with a more complex output?
For example, there are 200 different balls in a box. Every step 2 balls are picked and put back. The action space can be defined like Multidiscrete([200, 200]) or Tuple((spaces.Discrete(200), spaces.Discrete(200))).
There are 3 restrictions that make some actions invalid.
- every time the 2 balls are different. So actions like (1,1) or (2,2) is invalid.
- Balls with same color are not allowed to be picked together. For example, the No.2 and No.3 ball are both yellow, so they cannot be picked together under some state. So action(1,2) is invalid under that state.
- Some balls are not allowed to be picked under specific state. For example, when the No.2 ball is marked Not Allowed to Pickļ¼all actions with the No.2 ball like action (1, n) or (n,1) are invalid.
How to enforce these 3 constraints via action masking in rllib.
Assuming that there are 2 parts of our obs space. The first constraint is implict. The invalid action can be determined without observation space. For the second constraint, A real_obs marks each ball with a number indicating its color. Balls with the same number are not allowed to be picked together. For the third constraint, An action_mask which indicates if balls are allowed to pick.
Specifically, how to implement the action/observation space and the forward function in the custom model?
If my assumption of obs space is unfeasible. You can define your obs space and the corresponding custom model.