Numerically stable cross entropy loss calculation for Mixture of Experts model in Pytorch

Question

I am stuck with a supposedly simple problem. I have a mixture of experts system, consisting of multiple neural networks for classification, whose mixture weights are determined by the data as well. For the posterior probability of label y, given data x and K different expert networks, we have:

In this scheme, p(y|z_k,x) are expert network posterior probabilities, which are Softmax functions applied to network outputs. p(z_k|x) are the network weights.

My problem is the following. Usually, in Pytorch we feed the outputs of the last layer (logits) into the cross entropy loss function. Pytorch handles the numerical stability issues with the Logsumexp trick (How is log_softmax() implemented to compute its value (and gradient) with better speed and numerical stability?). In my case here however, my model output is not the logits, the probabilities, directly instead, due to the nature of the mixture model. Taking the logarithm of the mixture probabilities and feeding into the NLL loss crashes after a couple of iterations since some probabilities quickly become very close to 0 and underflow-overflow issues start to appear. The calculation would be very unstable, numerically.

In this particular case, what would be the correct way to calculate CE (or NLL) loss, without losing the numerical stability?

Numerically stable cross entropy loss calculation for Mixture of Experts model in Pytorch

0 Answers0