My two cents,
First, an intuitive understanding of the transformer encoder: Given (batch, horizon, features)
, the attention mechanism tries to find a weighted linear combination of the projected features
. The resulting weights are learned via attention scores obtained by operating between features
, over each horizon
. The FFN layer that comes next should be a linear combination of values within features
.
Coming to EinsumDense
by way of example we have two tensors:
a: Data (your input tensor to EinsumDense
)
b: Weights (EinsumDense
's internal weights tensor)
# create random data in a 3D tensor
a = tf.random.uniform(minval=1, maxval=3, shape=(1,2,3), dtype=tf.int32)
# [[[1, 2, 2],
# [2, 2, 1]]]
shf,h->shf:
This just scales the individual features.
b = tf.random.uniform(minval=2, maxval=4, shape=(2,), dtype=tf.int32)
# [3, 2]
tf.einsum('shf,h->shf', a, b)
# [[[3, 6, 6], #1st feature is scaled with 3
# [4, 4, 2]]]] #2nd feature is scaled with 2
shf,hz->shz: This does a linear combination within
features
b = tf.random.uniform(minval=2, maxval=4, shape=(2,6), dtype=tf.int32)
# [[3, 3, 3, 3, 3, 3],
# [2, 2, 2, 3, 2, 3]]
tf.einsum('shf,hz->shz', a, b)
# [[[15, 15, 15, 15, 15, 15],
# [10, 10, 10, 15, 10, 15]]]
# every value is a linear combination of the first feature [1, 2, 2] with b. The first value is sum([1,2,2]*3)
The above two resembles the transformer encoder
architecture, with a feature scaling layer. And the output structure is preserved (batch, H, F)
shf,hfyz->syz: This does both between
features and within
features combination.
b = tf.random.uniform(minval=2, maxval=4, shape=(2,3,4,5), dtype=tf.int32)
tf.einsum('shf,hfyz->syz', a,b)
# each element output `(i,j)` is a dot product of a and b[:,:,i,j]
# first element is tf.reduce_sum(a*b[:,:,0,0])
Here the output (s,y,z), y doesnt correspond to horizon
and z doesn't correspond to features
, but a combination of values between them.