My understanding of noise contrastive estimation is that we sample some vectors from our word embeddings (the negative sample), and then calculate the log-likelihood of each. Then we want to maximize the difference between the probability of the target word and the log-likelihood of each of the negative sample words (So if I'm correct about this, we want to optimize the loss function so that it gets as close to 1 as possible).
My question is this:
What is the purpose of the num_classes
parameters to the nce_loss
function? My best guess is that the number of classes is passed in so that Tensorflow knows the size of the distribution from which the negative samples our drawn, but this might not make sense, since we could just infer the size of the distribution from the variable itself. Otherwise, I can't think of a reason for why we would need to know the total possible number of classes, especially if the language model is only outputting k + 1 predictions (negative sample size + 1 for the target word).