A bug is encountered with the "Maskable PPO" training with custom Env setup

Question

I encountered an error while using SB3-contrib Maskable PPO action masking algorithm.

File ~\anaconda3\lib\site-packages\sb3_contrib\common\maskable\distributions.py:231, in MaskableMultiCategoricalDistribution.apply_masking(self, masks) 228 masks = th.as_tensor(masks) 230 # Restructure shape to align with logits --> 231 masks = masks.view(-1, sum(self.action_dims)) 233 # Then split columnwise for each discrete action 234 split_masks = th.split(masks, tuple(self.action_dims), dim=1)

RuntimeError: shape '[-1, 1600]' is invalid for input of size 800

I am running learning progamme with an action being a MultiBinary space with 800 selections of 0, 1.

The action space is defined as below:

self.action_space = spaces.MultiBinary(800)

Within the custom environment class, an "action_mask" function was created such that it returns a List of 800 boolean values.

Now, when I follow the document and start to train the model, the error message pops:

from sb3_contrib import MaskablePPO
from Equities_RL_Env import Equities_RL_Env
import time
from sb3_contrib.common.maskable.utils import get_action_masks

models_dir = f"models/V1 31-Jul/"
logdir = f"logs/{time.strftime('%d %b %Y %H-%M',time.localtime())}/"

if not os.path.exists(models_dir):
    os.makedirs(models_dir)

if not os.path.exists(logdir):
    os.makedirs(logdir)

env = Equities_RL_Env(Normalize_frame(historical_frame), pf)
env.reset()

model = MaskablePPO('MlpPolicy', env, verbose=1, tensorboard_log=logdir)

TIMESTEPS = 1000
iters = 0
while iters <= 1000000:
    iters += 1
    model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False, tb_log_name=f"PPO")
    model.save(f"{models_dir}/{TIMESTEPS*iters}")

May I know is there a way to define that shape within the custom environment?

A bug is encountered with the "Maskable PPO" training with custom Env setup

0 Answers0