0

I'm using rllib for the first time, and trying to traini a custom multi-agent RL environment, and would like to train a couple of PPO agents on it. The implementation hiccup I need to figure out is how to alter the training for one special agent such that this one only takes an action every X timesteps. Is it best to only call compute_action() every X timesteps? Or, on the other steps, to mask the policy selection such that they have to re-sample an action until a No-Op is called? Or to modify the action that gets fed into the environment + the previous actions in the training batches to be No-Ops?

What's the easiest way to implement this that still takes advantage of rllib's training features? Do I need to create a custom training loop for this, or is there a way to configure PPOTrainer to do this?

Thanks

sh0831
  • 1

1 Answers1

0

Let t:=timesteps so far. Give the special agent this feature: t (mod X), and don't process its actions in the environment when t (mod X) != 0. This accomplishes:

  1. the agent in effect is only taking actions every X timesteps because you are ignoring all the other ones
  2. the agent can learn that only the actions taken every X timesteps will affect the future rewards