I am training the agent using policy gradient method. After training, the agent would always choose one of two actions.
Below is my code
action = tf.where(self.model(state)[:,-1] > 0.5, 1., 0.)
reward = self.get_rewards(action, state)
with tf.GradientTape() as tape:
tape.watch(self.model.trainable_weights)
prob = self.model(state, True)
dist = tfp.distributions.Categorical(probs = prob)
log_prob = dist.log_prob(action)
loss = - tf.math.reduce_mean(reward * log_prob)
grads = tape.gradient(loss, self.model.trainable_weights)
self.optimizer.apply_gradients(zip(grads, self.model.trainable_weights))
Where self.get_rewards(action, state)
returns positive return (properly calculated), and self.model(state)
returns probability [p, (1-p)]
.
my guess is that the optimal choice is either p = 0 or 1
as it will make loss equal to 0, which is always minimum. It is the minimum since reward
is always positive and log_prob
is always negative, so - reward * log_prob
will always be positive.
Is there anyway to fix this problem? I tried to use off-policy gradient, but it did not help me much. I am not sure why though.