Train TensorFlow language model with NCE or sampled softmax

Question

I'm adapting the TensorFlow RNN tutorial to train a language model with a NCE loss or sampled softmax, but I still want to report perplexities. However, the perplexities I get are very weird: for NCE I get several millions (terrible!) whereas for sampled softmax I get a PPL of 700 after one epoch (too good to be true?!). I wonder what I'm doing wrong.

Here is my adaptation to the PTBModel:

class PTBModel(object):
  """The PTB model."""

  def __init__(self, is_training, config, loss_function="softmax"):
    ...
    w = tf.get_variable("proj_w", [size, vocab_size])
    w_t = tf.transpose(w)
    b = tf.get_variable("proj_b", [vocab_size])

    if loss_function == "softmax":
      logits = tf.matmul(output, w) + b
      loss = tf.nn.seq2seq.sequence_loss_by_example(
          [logits],
          [tf.reshape(self._targets, [-1])],
          [tf.ones([batch_size * num_steps])])
      self._cost = cost = tf.reduce_sum(loss) / batch_size
    elif loss_function == "nce":
      num_samples = 10
      labels = tf.reshape(self._targets, [-1,1])
      hidden = output
      loss = tf.nn.nce_loss(w_t, b,                           
                            hidden,
                            labels,
                            num_samples, 
                            vocab_size)
    elif loss_function == "sampled_softmax":
      num_samples = 10
      labels = tf.reshape(self._targets, [-1,1])
      hidden = output
      loss = tf.nn.sampled_softmax_loss(w_t, b,
                                        hidden, 
                                        labels, 
                                        num_samples,
                                        vocab_size)

    self._cost = cost = tf.reduce_sum(loss) / batch_size
    self._final_state = state

The call to this model is like this:

mtrain = PTBModel(is_training=True, config=config, loss_function="nce")
mvalid = PTBModel(is_training=True, config=config)

I'm not doing anything exotic here, changing the loss function should be pretty straightforward. So why does it not work?

Thanks, Joris

score 0 · Answer 1 · answered Jul 14 '16 at 19:23

0

With the baseline model (Softmax), in one epoch you should be getting way better than 700. By changing the loss you may need to re-tune some of the hyper parameters -- in particular, learning rate.

Also, your evaluation model should report true perplexities by using a Softmax -- are you doing that?

answered Jul 14 '16 at 19:23

Oriol Vinyals

1

Seems like the sampled softmax does work, it ends up at 129 with 20 negative samples after 13 epochs (the SmallConfig). – niefpaarschoenen Jul 14 '16 at 20:05
1

NCE on the other hand is still failing me. Perplexities (computed with full softmax as you say) are in the order of millions. Agreed that I need to re-tune, but even without tuning I would expect perplexities to drop a little rather than increase from ~10k to 2M?! – niefpaarschoenen Jul 14 '16 at 20:07
FYI: NCE actually gives reasonable values for a low number of time steps. It starts to go crazy when you increase that number. – niefpaarschoenen Jul 26 '16 at 06:56
@niefpaarschoenen hi, I'm currently working on it. Did you find performance improvement using NCE? Specifically in term of word per sec? Thx – pltrdy Nov 21 '16 at 15:47

Train TensorFlow language model with NCE or sampled softmax

1 Answers1