Automatic differentiation in policy gradient networks

Question

I do understand the backpropagation in policy gradient networks, but am not sure how works with libraries that auto-differentiate.

That is, how they transform it into a supervised learning problem. For example, the code below:

Y = self.probs + self.learning_rate * np.squeeze(np.vstack([gradients]))

Why is Y not 1-hot vector for the action taken? He is computing the gradient assuming the action is correct, Y is one-hot vector. Then he multiplies it by the reward in the corresponding time-step. But while training he feeds it as the correction. I think he should multiply the rewards by one-hot vector instead. https://github.com/keon/policy-gradient/blob/master/pg.py#L67

score 0 · Answer 1 · answered Oct 25 '17 at 18:49

0

Y is not a 1-hot vector because it is the sum of action probability (ie self.prob) multiplied by its corresponding reward.

answered Oct 25 '17 at 18:49

mynameisvinn

341
4
10

Automatic differentiation in policy gradient networks

1 Answers1