Hi StackOverflow Community,
I have a problem with the policy gradient methods in reinforcement learning.
In policy gradient methods, we increase/decrease the log probability of an action based on the return (i.e. total rewards) from that step onwards. So if our return is high, we increase it but I have problem at this step.
Let say that we have three rewards in our return. Although the sum of all these three rewards is high, the second reward is really bad.
How do we deal with this problem? How do we assess each reward separately? Is there an alternative version of this policy gradient methods?