Using CNTK to generate sequence by sampling at each generation step

Question

In a seq2seq model with an encoder and a decoder, at each generation step a softmax layer outputs a distribution over the entire vocabulary. In CNTK, a greedy decoder can be implemented easily by using the C.hardmax function. It looks like this.

def create_model_greedy(s2smodel):
    # model used in (greedy) decoding (history is decoder's own output)
    @C.Function
    @C.layers.Signature(InputSequence[C.layers.Tensor[input_vocab_dim]])
    def model_greedy(input): # (input*) --> (word_sequence*)
        # Decoding is an unfold() operation starting from sentence_start.
        # We must transform s2smodel (history*, input* -> word_logp*) into a generator (history* -> output*)
        # which holds 'input' in its closure.
        unfold = C.layers.UnfoldFrom(lambda history: s2smodel(history, input) >> **C.hardmax**,
                                     # stop once sentence_end_index was max-scoring output
                                     until_predicate=lambda w: w[...,sentence_end_index],
                                     length_increase=length_increase)
        return unfold(initial_state=sentence_start, dynamic_axes_like=input)
    return model_greedy

However, at each step I don't want to output the token with the maximum probability. Instead, I want to have a random decoder, which generates a token according to the probability distribution of the vocabulary.

How can I do that? Any help is appreciated. Thanks.

score 3 · Answer 1 · answered Aug 16 '17 at 18:19

You can just add noise to the outputs before taking the hardmax. In particular, you can use C.random.gumbel or C.random.gumbel_like to sample proportionally to exp(output). This is known as the gumbel-max trick. The cntk.random module contains other distributions as well but if you have log probabilities you most likely want to add gumbel noise before hardmax. Some code:

@C.Function
def randomized_hardmax(x):
    noisy_x = x + C.random.gumbel_like(x)
    return C.hardmax(noisy_x)

Then replace your hardmax with the randomized_hardmax.

I don't have 15 reputation on this new account... I am in China now and I cannot log onto my gmail account or use the facebook account. I will upvote your answer as soon as I get back to US. Thank you again. — meijiesky, Aug 18 '17 at 05:40

score 0 · Answer 2 · answered Aug 17 '17 at 02:21

A lot of thanks to Nikos Karampatziakis.

The following code works if you want to have a stochastic sampling decoder which generates a sequence with the same length as your target sequence.

@C.Function
def sampling(x):
    noisy_x = x + C.random.gumbel_like(x)
    return C.hardmax(noisy_x)

def create_model_sampling(s2smodel):
    @C.Function
    @C.layers.Signature(input=InputSequence[C.layers.Tensor[input_vocab_dim]],
                        labels=LabelSequence[C.layers.Tensor[label_vocab_dim]])
    def model_sampling(input, labels): # (input*) --> (word_sequence*)
        unfold = C.layers.UnfoldFrom(lambda history: s2smodel(history, input) >> sampling,
                                     length_increase=1)
        return unfold(initial_state=sentence_start, dynamic_axes_like=labels)
    return model_sampling

Using CNTK to generate sequence by sampling at each generation step

2 Answers2