0

I do understand the backpropagation in policy gradient networks, but am not sure how works with libraries that auto-differentiate.

That is, how they transform it into a supervised learning problem. For example, the code below:

Y = self.probs + self.learning_rate * np.squeeze(np.vstack([gradients]))

Why is Y not 1-hot vector for the action taken? He is computing the gradient assuming the action is correct, Y is one-hot vector. Then he multiplies it by the reward in the corresponding time-step. But while training he feeds it as the correction. I think he should multiply the rewards by one-hot vector instead. https://github.com/keon/policy-gradient/blob/master/pg.py#L67

Abhishek Bhatia
  • 9,404
  • 26
  • 87
  • 142

1 Answers1

0

Y is not a 1-hot vector because it is the sum of action probability (ie self.prob) multiplied by its corresponding reward.

mynameisvinn
  • 341
  • 4
  • 10