Policy gradient (REINFORCE) diverging when finding the shortest path in a graph with negative rewards

Question

I want to use the policy gradient to find the shortest path among a group of nodes in a network.
The network is represented using a graph with edges labeled with value -1.
Now, a path with a negative value closest to 0 is the shortest path.
Therefore, I am using gradient descent for policy parameter updates.

Here is the update rule in TensorFlow.

self.cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(labels = self.outputTrue, logits = self.outputPred)
self.cerd = tf.tensordot(self.cross_entropy, self.reward, axes=1)
self.meanCEloss = self.cerd/tf.cast(BS,tf.float32) # BS is the batch size.
self.train_step = tf.train.AdamOptimizer(1e-4).minimize(self.meanCEloss)

However, after running the code, self.meanCEloss keeps decreasing towards negative infinity until underflow occurs.
What changes are required in the loss evaluation to solve the problem?

score 0 · Answer 1 · answered Dec 01 '19 at 17:54

0

Multiply by -1 and then minimize. This way, it would try to find the path with least score: the shortest bath. What you are doing is unbounded minimization of negative terms which would go to -∞ .

answered Dec 01 '19 at 17:54

pzolaiyc

11
3

If I multiply self.meanCEloss by -1, then it will convert all the rewards to positive values, and now the model will try to maximize the rewards and find the longes path. – abhi Dec 01 '19 at 18:50
What you are implementing is gradient descent, not gradient ascent. – pzolaiyc Dec 02 '19 at 01:35
Maximizing log-likelihood is equivalent to minimizing cross-entropy loss. Hence by converting rewards to positive value by either squaring or taking absolute value will find the longest path instead of the shortest. – abhi Dec 02 '19 at 06:37
That is true if your rewards are positive. – pzolaiyc Dec 02 '19 at 07:23
1

Please see this explanation: https://youtu.be/bRfUxQs6xIM (33:11) – pzolaiyc Dec 05 '19 at 11:10

Policy gradient (REINFORCE) diverging when finding the shortest path in a graph with negative rewards

1 Answers1