0

In algorithm PPO, a ratio needs to be calculated as ratios = torch.exp(new_probs-old_probs) which is the ratio between the probability of action under the current policy divided by the probability of the action under the previous policy. But in my practice, the ratio equals to 1 and it never changes. At the same time, the actor loss and the critic loss are decreasing, but the average episode reward is fluctuating with no upward trend. Is this related to the ratio being equal to 1?

I don't know where the problem is. Has anyone seen the same problem before? Can you give me some suggestions? Thanks a lot!

cxzhou
  • 3
  • 4

1 Answers1

0

your policy network will be update multiple times with same data, old_probs will stay and new_probs with change(each update), and yes the ratio is 1 for the first update after finish collect new data, but the ratio change after first update, and was clip between 1-epsilon and 1+epsilon

  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Apr 18 '23 at 13:25