Sampled softmax loss over variable sequence batches?

Question

Background info: I'm working on sequence-to-sequence models, and right now my model accepts variable-length input tensors (not lists) with input shapes corresponding to [batch size, sequence length]. However, in my implementation, sequence length is unspecified (set to None) to allow for variable length inputs. Specifically, input sequence batches are padded only to the length of the longest sequence in that batch. This has sped up my training time considerably, so I'd prefer to keep it this way, as opposed to going back to bucketed models and/or padded all sequences in the training data to the same length. I'm using TensorFlow 1.0.0.

Problem: I'm currently using the following to compute the loss (which runs just fine).

loss = tf.losses.sparse_softmax_cross_entropy(
    weights=target_labels,  # shape: [batch size, None]
    logits=outputs[:, :-1, :], # shape: [batch size, None, vocab size]
    weights=target_weights[:, :-1]) # shape: [batch size, None]

where vocab size is typically about 40,000. I'd like to use a sampled softmax, but I've ran into an issue that's due to the unspecified nature of the input shape. According to the documentation for tf.nn.sampled_softmax_loss, it requires the inputs to be fed separately for each timestep. However, I can't call, for example,

tf.unstack(target_labels, axis=1)

since the axis is unknown beforehand.Does anyone know how I might go about implementing this? One would assume that since both dynamic_rnn and tf.losses.sparse_softmax_cross_entropy seem to have no issue doing this, that a workaround could be implemented with the sampled softmax loss somehow. After digging around in the source code and even models repository, I've come up empty handed. Any help/suggestions would be greatly appreciated.

So the reason you want to unstack it is to choose different negative samples for each time step. Would it work to do different negative samples across the batch dimension instead? This might work if you always have the same batch size for training. — Aaron, Mar 13 '17 at 02:29
I agree with Aaron, maybe it would work if you manully do the negative sampling, prepare the true_logits and sampled_logits and caculate the loss, you can see the source code of `sampled_softmax_loss` or `nce_loss` for how to do negative sampling.BTW, padding is really annoying in rnn of TensorFlow, Tensorflow-Fold or PyTorch maybe better choices — Jie.Zhou, Mar 13 '17 at 02:47
Interesting, I hadn't thought of that. Now that I think of it, wouldn't that be the more correct way to do the negative sampling? From what I understand, the algorithm (from Cho et al., 2015) is defined, *for a given sequence*, over a subset of the output distribution. I don't see how you could accurately implement the algorithm without having the previous output tokens. Guess it's time to dig into the source code. Also, why do you say padding is annoying in TF as opposed to the others? Thanks for the suggestions. — Brandon McKinzie, Mar 13 '17 at 03:24

Sampled softmax loss over variable sequence batches?

0 Answers0