I'm trying to implement a Seq2Seq model with attention in CNTK, something very similar to CNTK Tutorial 204. However, several small differences lead to various issues and error messages, which I don't understand. There are many questions here, which are probably interconnected and all stem from some single thing I don't understand.
Note (in case it's important). My input data comes from MinibatchSourceFromData
, created from NumPy arrays that fit in RAM, I don't store it in a CTF.
ins = C.sequence.input_variable(input_dim, name="in", sequence_axis=inAxis)
y = C.sequence.input_variable(label_dim, name="y", sequence_axis=outAxis)
Thus, the shapes are [#, *](input_dim)
and [#, *](label_dim)
.
Question 1: When I run the CNTK 204 Tutorial and dump its graph into a .dot
file using cntk.logging.plot
, I see that its input shapes are [#](-2,)
. How is this possible?
- Where did the sequence axis (
*
) disappear? - How can a dimension be negative?
Question 2: In the same tutorial, we have attention_axis = -3
. I don't understand this. In my model there are 2 dynamic axis and 1 static, so "third to last" axis would be #
, the batch axis. But attention definitely shouldn't be computed over the batch axis.
I hoped that looking at the actual axes in the tutorial code would help me understand this, but the [#](-2,)
issue above made this even more confusing.
Setting attention_axis
to -2
gives the following error:
RuntimeError: Times: The left operand 'Placeholder('stab_result', [#, outAxis], [128])'
rank (1) must be >= #axes (2) being reduced over.
during creation of the training-time model:
def train_model(m):
@C.Function
def model(ins: InputSequence[Tensor[input_dim]],
labels: OutputSequence[Tensor[label_dim]]):
past_labels = Delay(initial_state=C.Constant(seq_start_encoding))(labels)
return m(ins, past_labels) #<<<<<<<<<<<<<< HERE
return model
where stab_result
is a Stabilizer
right before the final Dense
layer in the decoder. I can see in the dot-file that there are spurious trailing dimensions of size 1 that appear in the middle of the AttentionModel
implementation.
Setting attention_axis
to -1
gives the following error:
RuntimeError: Binary elementwise operation ElementTimes: Left operand 'Output('Block346442_Output_0', [#, outAxis], [64])'
shape '[64]' is not compatible with right operand
'Output('attention_weights', [#, outAxis], [200])' shape '[200]'.
where 64 is my attention_dim
and 200 is my attention_span
. As I understand, the elementwise *
inside the attention model definitely shouldn't be conflating these two together, therefore -1
is definitely not the right axis here.
Question 3: Is my understanding above correct? What should be the right axis and why is it causing one of the two exceptions above?
Thanks for the explanations!