The Bernoulli distribution is based on the probabilities that a value is equal to 1. The layer IndependentBernoulli
from tensorflow_probability
fits these probabilities (in my understanding).
However, if gradient descent were to decrease these probabilities to below or equal to 0 or greater or equal to 1, then the method log_prob
will naturally produce invalid values. I suspect that this is the cause of the NaN
s encountered during training.
Therefore, I wonder whether it is possible to constrain the learnt probabilities as you would constrain the kernel of regular Keras layers. Any help would be appreciated.