0

I am training the agent using policy gradient method. After training, the agent would always choose one of two actions.

Below is my code

action = tf.where(self.model(state)[:,-1] > 0.5, 1., 0.)
reward = self.get_rewards(action, state)
with tf.GradientTape() as tape:
    tape.watch(self.model.trainable_weights)
    prob = self.model(state, True)
    dist = tfp.distributions.Categorical(probs = prob)
    log_prob = dist.log_prob(action)
    loss = - tf.math.reduce_mean(reward * log_prob)
grads = tape.gradient(loss, self.model.trainable_weights)
self.optimizer.apply_gradients(zip(grads, self.model.trainable_weights))   

Where self.get_rewards(action, state) returns positive return (properly calculated), and self.model(state) returns probability [p, (1-p)].

my guess is that the optimal choice is either p = 0 or 1 as it will make loss equal to 0, which is always minimum. It is the minimum since reward is always positive and log_prob is always negative, so - reward * log_prob will always be positive.

Is there anyway to fix this problem? I tried to use off-policy gradient, but it did not help me much. I am not sure why though.

user1292919
  • 193
  • 8

0 Answers0