How do we assess each reward in the return in Policy Gradient Methods?

Question

Hi StackOverflow Community,

I have a problem with the policy gradient methods in reinforcement learning.

In policy gradient methods, we increase/decrease the log probability of an action based on the return (i.e. total rewards) from that step onwards. So if our return is high, we increase it but I have problem at this step.

Let say that we have three rewards in our return. Although the sum of all these three rewards is high, the second reward is really bad.

How do we deal with this problem? How do we assess each reward separately? Is there an alternative version of this policy gradient methods?

score 0 · Answer 1 · answered Jun 11 '19 at 14:28

This is a multi-objective problem, where the reward is not scalar but a vector. By definition, there is no single optimal policy in the classical sense, but there is a set of Pareto-optimal policies, i.e., for which you cannot perform better w.r.t. an objective (max sum of first reward, for instance) without losing something on the other objective (max sum of other rewards). There are many ways to approach multi-objective problems, both in optimization (often genetic algorithms) and in RL. Naively, you could just apply a scalarization to the rewards by linear weighting, but that's really inefficient. More sophisticated approaches learn a manifold in policy parameters space (e.g. this).

How do we assess each reward in the return in Policy Gradient Methods?

1 Answers1