Multiple issues with axes while implementing a Seq2Seq with attention in CNTK

Question

I'm trying to implement a Seq2Seq model with attention in CNTK, something very similar to CNTK Tutorial 204. However, several small differences lead to various issues and error messages, which I don't understand. There are many questions here, which are probably interconnected and all stem from some single thing I don't understand.

Note (in case it's important). My input data comes from MinibatchSourceFromData, created from NumPy arrays that fit in RAM, I don't store it in a CTF.

ins = C.sequence.input_variable(input_dim, name="in", sequence_axis=inAxis)
y = C.sequence.input_variable(label_dim, name="y", sequence_axis=outAxis)

Thus, the shapes are [#, *](input_dim) and [#, *](label_dim).

Question 1: When I run the CNTK 204 Tutorial and dump its graph into a .dot file using cntk.logging.plot, I see that its input shapes are [#](-2,). How is this possible?

Where did the sequence axis (*) disappear?
How can a dimension be negative?

Question 2: In the same tutorial, we have attention_axis = -3. I don't understand this. In my model there are 2 dynamic axis and 1 static, so "third to last" axis would be #, the batch axis. But attention definitely shouldn't be computed over the batch axis.
I hoped that looking at the actual axes in the tutorial code would help me understand this, but the [#](-2,) issue above made this even more confusing.

Setting attention_axis to -2 gives the following error:

RuntimeError: Times: The left operand 'Placeholder('stab_result', [#, outAxis], [128])'
              rank (1) must be >= #axes (2) being reduced over.

during creation of the training-time model:

def train_model(m):
    @C.Function
    def model(ins: InputSequence[Tensor[input_dim]],                  
              labels: OutputSequence[Tensor[label_dim]]):
        past_labels = Delay(initial_state=C.Constant(seq_start_encoding))(labels)
        return m(ins, past_labels)  #<<<<<<<<<<<<<< HERE
    return model

where stab_result is a Stabilizer right before the final Dense layer in the decoder. I can see in the dot-file that there are spurious trailing dimensions of size 1 that appear in the middle of the AttentionModel implementation.

Setting attention_axis to -1 gives the following error:

RuntimeError: Binary elementwise operation ElementTimes: Left operand 'Output('Block346442_Output_0', [#, outAxis], [64])'
              shape '[64]' is not compatible with right operand 
              'Output('attention_weights', [#, outAxis], [200])' shape '[200]'.

where 64 is my attention_dim and 200 is my attention_span. As I understand, the elementwise * inside the attention model definitely shouldn't be conflating these two together, therefore -1 is definitely not the right axis here.

Question 3: Is my understanding above correct? What should be the right axis and why is it causing one of the two exceptions above?

Thanks for the explanations!

score 3 · Accepted Answer · answered Sep 13 '17 at 00:02

First, some good news: A couple of things have been fixed in the AttentionModel in the latest master (will be generally available with CNTK 2.2 in a few days):

You don't need to specify an attention_span or an attention_axis. If you don't specify them and leave them at their default values, the attention is computed over the whole sequence. In fact these arguments have been deprecated.
If you do the above the 204 notebook runs 2x faster, so the 204 notebook does not use these arguments anymore
A bug has been fixed in the AttentionModel and it now faithfully implements the Bahdanau et. al. paper.

Regarding your questions:

The dimension is not negative. We use certain negative numbers in various places to mean certain things: -1 is a dimension that will be inferred once based on the first minibatch, -2 is I think the shape of a placeholder, and -3 is a dimension that will be inferred with each minibatch (such as when you feed variable sized images to convolutions). I think if you print the graph after the first minibatch, you should see all shapes are concrete.

attention_axis is an implementation detail that should have been hidden. Basically attention_axis=-3 will create a shape of (1, 1, 200), attention_axis=-4 will create a shape of (1, 1, 1, 200) and so on. In general anything more than -3 is not guaranteed to work and anything less than -3 just adds more 1s without any clear benefit. The good news of course is that you can just ignore this argument in the latest master.

TL;DR: If you are in master (or starting with CNTK 2.2 in a few days) replace AttentionModel(attention_dim, attention_span=200, attention_axis=-3) with AttentionModel(attention_dim). It is faster and does not contain confusing arguments. Starting from CNTK 2.2 the original API is deprecated.

Thanks for the clarification, Nikos! Quick question: if I want to try this right now, and I don't want to build CNTK from source (yet), would copying the new Python implementation of `AttentionModel` suffice, or does it also require some recent fixes in the native code? — Skiminok, Sep 13 '17 at 00:33
There were some bug fixes in native code. It would not suffice to copy the new implementation. — Nikos Karampatziakis, Sep 13 '17 at 06:00

Multiple issues with axes while implementing a Seq2Seq with attention in CNTK

1 Answers1