I am in a reinforcement learning setting, where my environments action space depends on state. As a result, I go through the following procedure when sampling behavior actions:
(1) generate probability logits for all possible actions
(2) compute softmax over these logits
(3) mask the actions that are not valid in this state (by multiplying by a vector of zeros and ones), which zeros out some of the probabilities
(4) renormalize the valid action probabilities
(5) sample from this distribution
This works perfectly well for generating actions. However, I run into issues when I need to calculate the policy gradient update. Typically one does the following:
tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=A)
where logits are the probability logits, and A is the sampled action. But, since I do this masking/renormalizing after performing softmax, the above code snippet is not the correct cross entropy in my case. I am wondering if there is a graceful way to handle this situation. My understanding is that one should always use tensorflow's cross-entropy calculations for numerical stability, however I am unsure how to correctly incorporate this masking/renormalization.