2

My understanding of noise contrastive estimation is that we sample some vectors from our word embeddings (the negative sample), and then calculate the log-likelihood of each. Then we want to maximize the difference between the probability of the target word and the log-likelihood of each of the negative sample words (So if I'm correct about this, we want to optimize the loss function so that it gets as close to 1 as possible).

My question is this:

What is the purpose of the num_classes parameters to the nce_loss function? My best guess is that the number of classes is passed in so that Tensorflow knows the size of the distribution from which the negative samples our drawn, but this might not make sense, since we could just infer the size of the distribution from the variable itself. Otherwise, I can't think of a reason for why we would need to know the total possible number of classes, especially if the language model is only outputting k + 1 predictions (negative sample size + 1 for the target word).

Maxim
  • 52,561
  • 27
  • 155
  • 209
Aj Langley
  • 127
  • 9

1 Answers1

0

Your guess is correct. The num_classes argument is used to sample negative labels from the log-uniform (Zipfian) distribution.

Here's the link to the source code:

# Sample the negative labels.
#   sampled shape: [num_sampled] tensor
#   true_expected_count shape = [batch_size, 1] tensor
#   sampled_expected_count shape = [num_sampled] tensor
if sampled_values is None:
  sampled_values = candidate_sampling_ops.log_uniform_candidate_sampler(
      true_classes=labels,
      num_true=num_true,
      num_sampled=num_sampled,
      unique=True,
      range_max=num_classes)

The range_max=num_classes argument basically defines the shape of this distribution and also the range of the sampled values - [0, range_max). Note that this range can't be accurately inferred from the labels, because a particular mini-batch can have only small word ids, which would skew the distribution significantly.

Maxim
  • 52,561
  • 27
  • 155
  • 209