I want to use the policy gradient to find the shortest path among a group of nodes in a network.
The network is represented using a graph with edges labeled with value -1.
Now, a path with a negative value closest to 0 is the shortest path.
Therefore, I am using gradient descent for policy parameter updates.
Here is the update rule in TensorFlow.
self.cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(labels = self.outputTrue, logits = self.outputPred)
self.cerd = tf.tensordot(self.cross_entropy, self.reward, axes=1)
self.meanCEloss = self.cerd/tf.cast(BS,tf.float32) # BS is the batch size.
self.train_step = tf.train.AdamOptimizer(1e-4).minimize(self.meanCEloss)
However, after running the code, self.meanCEloss keeps decreasing towards negative infinity until underflow occurs.
What changes are required in the loss evaluation to solve the problem?