I figured out that the issue was that, in my qloss()
function I was pulling values out of the tensors, doing operations on them and returning the values. While the values did depend on the tensors, they were't incapsulated in tensors themselves, so TensorFlow couldn't tell that they depended on the tensors in the graph.
I fixed this by changing qloss()
so that it did operations directly on the tensors and returned a tensor. Here's the new function:
def qloss(actions, rewards, target_Qs, pred_Qs):
"""
Q-function loss with target freezing - the difference between the observed
Q value, taking into account the recently received r (while holding future
Qs at target) and the predicted Q value the agent had for (s, a) at the time
of the update.
Params:
actions - The action for each experience in the minibatch
rewards - The reward for each experience in the minibatch
target_Qs - The target Q value from s' for each experience in the minibatch
pred_Qs - The Q values predicted by the model network
Returns:
A list with the Q-function loss for each experience clipped from [-1, 1]
and squared.
"""
ys = rewards + DISCOUNT * target_Qs
#For each list of pred_Qs in the batch, we want the pred Q for the action
#at that experience. So we create 2D list of indeces [experience#, action#]
#to filter the pred_Qs tensor.
gather_is = tf.squeeze(np.dstack([tf.range(BATCH_SIZE), actions]))
action_Qs = tf.gather_nd(pred_Qs, gather_is)
losses = ys - action_Qs
clipped_squared_losses = tf.square(tf.minimum(tf.abs(losses), 1))
return clipped_squared_losses