Why can the key dimension be different than the input sequence length for self attention on a time series?

Asked Nov 05 '22 at 18:49

Active Nov 05 '22 at 18:51

Viewed 56 times

In Timeseries classification with a Transformer model, the author builds a transformer encoder like this

def transformer_encoder(inputs, head_size, num_heads, ff_dim, dropout=0):
    x = layers.LayerNormalization(epsilon=1e-6)(inputs)
    x = layers.MultiHeadAttention(key_dim=head_size, num_heads=num_heads, dropout=dropout
    )(x, x)
 ...

And later sets key_dim=256 even though the input shape is [None,500,1]

Why can the input sequence length be 500, but the key dimension be 256? It seems that they should be the same and that the input sequence is the key sequence.

edited Nov 05 '22 at 18:51

Progman

16,827
6
33
48

asked Nov 05 '22 at 18:49

aez

2,406
2
26
46

Why can the key dimension be different than the input sequence length for self attention on a time series?

0 Answers0