In algorithm PPO, a ratio needs to be calculated as ratios = torch.exp(new_probs-old_probs)
which is the ratio between the probability of action under the current policy divided by the probability of the action under the previous policy.
But in my practice, the ratio equals to 1 and it never changes. At the same time, the actor loss and the critic loss are decreasing, but the average episode reward is fluctuating with no upward trend. Is this related to the ratio being equal to 1?
I don't know where the problem is. Has anyone seen the same problem before? Can you give me some suggestions? Thanks a lot!