I have a continuous problem and I should solve it with multi agent deep deterministic policy gradient (MADDPG). My environment has 7 states and 3 actions. the range of 2 of actions are between [0,1] and the range of one of the actions is between [1,100]. I have used sigmoid activation function for the last layer of the actor network. The algorithm seems to learn nothing and it only returns the boundary actions. for example [1, 100, 0] or [0,1,1]. and the rewards do not improve. I have used ornstein uhlenbeck noise for the exploration process.
What I have tried to do:
- I have exprimented a lot with my hyperparameters.
- I have clipped the gradients.
- I have used prioritized experience replay.
- I have target networks for both actor and critic.
but the problem has not been solved yet.
any reply or reference that can help me will be appreciated.