0

The Bernoulli distribution is based on the probabilities that a value is equal to 1. The layer IndependentBernoulli from tensorflow_probability fits these probabilities (in my understanding). However, if gradient descent were to decrease these probabilities to below or equal to 0 or greater or equal to 1, then the method log_prob will naturally produce invalid values. I suspect that this is the cause of the NaNs encountered during training.

Therefore, I wonder whether it is possible to constrain the learnt probabilities as you would constrain the kernel of regular Keras layers. Any help would be appreciated.

0 Answers0