I'm using rllib for the first time, and trying to traini a custom multi-agent RL environment, and would like to train a couple of PPO agents on it. The implementation hiccup I need to figure out is how to alter the training for one special agent such that this one only takes an action every X timesteps. Is it best to only call compute_action() every X timesteps? Or, on the other steps, to mask the policy selection such that they have to re-sample an action until a No-Op is called? Or to modify the action that gets fed into the environment + the previous actions in the training batches to be No-Ops?
What's the easiest way to implement this that still takes advantage of rllib's training features? Do I need to create a custom training loop for this, or is there a way to configure PPOTrainer to do this?
Thanks