I am experiencing some issues in computing the actor update in DDPG algorithm using Tensorflow 2. The following is the code for both critic and actor updates:
with tf.GradientTape() as tape: #persistent=True
# compute current action values
current_q = self.Q_model(stat_act.astype('float32'))
# compute target action values
action_next = TargetNet.p_model(states_next.astype('float32'))
stat_act_next = np.concatenate((states_next,action_next),axis=1)
target_q = TargetNet.Q_model(stat_act_next.astype('float32'))
target_values = rewards+self.gamma*target_q
loss_q = self.loss(y_true=target_values,y_pred=current_q)
variables_q = self.Q_model.trainable_variables
gradients_q = tape.gradient(loss_q, variables_q)
self.optimizer.apply_gradients(zip(gradients_q, variables_q))
with tf.GradientTape() as tape:
current_actions = self.p_model(states.astype('float32'))
current_q_pg = self.Q_model(np.concatenate((states.astype('float32'),
current_actions),
axis=1))
loss_p = - tf.math.reduce_mean(current_q_pg)
variables_p = self.p_model.trainable_variables
gradients_p = tape.gradient(loss_p, variables_p)
self.optimizer.apply_gradients(zip(gradients_p, variables_p))
Those updates are part of a class method and actor and critic network are specified separately. The issue is that gradient_p
is returned as a list of None
. I don't know what is wrong in this piece of code. I am completely aware that I could split the computation of policy gradients according to the chain rule, but I don't know how to compute the derivate of the critic values with respect to the action input using tf.GradientTape
. How can I correctly implement this part? I don't understand why tf.GradientTape
is not able to go back to the trainable variables of the actor network and perform the computation in just one pass.