How exactly does this positional encoding being calculated?
Let's assume a machine translation scenario and these are input sentences,
english_text = [this is good, this is bad]
german_text = [das ist gut, das ist schlecht]
Now our input vocabulary size is 4 and embedding dimension is 4.
#words #embeddings
this - [0.5, 0.2, 0.3, 0.1]
is - [0.1, 0.2, 0.5, 0.1]
good - [0.9, 0.7, 0.9, 0.1]
bad - [0.7, 0.3, 0.4, 0.1]
As per transformer paper we add the each word position encoding with each word embedding and then pass it to encoder like seen in the image below,
As far as the paper is concerned they given this formula for calculating position encoding of each word,
So, this is how I think I can implement it,
d_model = 4 # Embedding dimension
positional_embeddings = np.zeros((max_sentence_length, d_model))
max_sentence_length = 3 # as per my examples above
for position in range(maximum_sentence_length):
for i in range(0, d_model, 2):
positional_embeddings[position, i] = (
sin(position / (10000 ** ( (2*i) / d_model) ) )
)
positional_embeddings[position, i + 1] = (
cos(position / (10000 ** ( (2 * (i + 1) ) / d_model) ) )
)
Then, the new embedding vector will be
[[0.5, 0.2, 0.3, 0.1],
[0.1, 0.2, 0.5, 0.1],
[0.9, 0.7, 0.9, 0.1]] + positional_embeddings = NEW EMBEDDINGS
## shapes
3 x 4 + 3 x 4 = 3 x 4
Is this how the calculation will be carried out in the implementation? Do correct me if there's any mistake in my above pseudo implementation.
If everything is correct then I have three doubts hope someone can clear them,
1) From the above implementation we are using sin formula for even positions and cos formula for odd positions but I couldn't understand the reason behind it? I read that it's taking use of cyclic properties but couldn't understand it.
2) Is there a reason behind choosing 10000/(2i/d)
or 10000/(2i+1/d)
as scaling factor in formula.
3) All the sentence will not be equal to max sentence length so we might have to padded the sentence so do we also calculate positional encondings to padding tokens.