So I have this weird behavior of ray tune that I can't make sense of.
What I'm trying to do:
- I have setup a custom rllib multi-agent env with two agents
- Both agents have different observation and action spaces
- Both should be trained with PPO but with a different config since I'm using custom Pytorch Models for them as well
You can see my configuration code below:
if __name__ == "__main__":
ray.init()
policies = {
"pol_1": (
None,
obs_space_a,
act_space_a
{
"model": {
"custom_model": "model_a",
"custom_model_config": {
"hidden_layer_size": 64,
"num_hidden_layers": 2,
"activation": "leaky_relu",
},
},
},
),
"pol_2": (
None,
obs_space_b,
act_space_a,
{
"model": {
"custom_model": "attacker_model",
"custom_model_config": {
"hidden_layer_size": tune.choice([64, 128, 256, 512, 1024]),
"num_hidden_layers": 2,
},
},
},
),
}
config = (
AlgorithmConfig()
.environment("concurrent_env", env_config={"num_agents": 2})
.training(train_batch_size=1024, lr=1e-3, gamma=0.99)
.framework("torch")
.rollouts(num_rollout_workers=1, rollout_fragment_length="auto")
.multi_agent(
policies=policies,
policy_mapping_fn=(
lambda agent_id, episode, worker, **kw: f"pol_{agent_id}"
),
policies_to_train=["pol_1", "pol_2"],
)
)
results = tune.Tuner(
"PPO",
param_space=config.to_dict(),
run_config=air.RunConfig(
stop={"training_iteration": 10000},
callbacks=[AttackerRewardCallback()],
#verbose=1,
),
tune_config=tune.TuneConfig(num_samples=1),
).fit()
As you can see I'd like to perform a hyperparameter optimization for my second agent.
The weird thing is: The hidden_layer_size seems to get sampled two times as you can see in this picture.
Of course only one can be used but there seems to be something wrong with my config.
I would expect that ...custom_model_config/hidden_layer_size only shows up one time.
If I run the analysis and print the chosen hidden_layer_size everytime the forward() function is called then it seems that only the first one is used anyways.