Is positional encoding necessary for transformer in language modeling?

Question

I am developing a language model like https://pytorch.org/tutorials/beginner/transformer_tutorial.html.

It is not clear for me - whether positional encoding is neccessary here ? As far as I understand - it is necessary for language translation task because the decoder should be able to position the word from the previous output within the sequence from encoder. But is it necessary in language modeling without the decoder ?

Is it possible that the words in the encoder output are shuffled ?

Edit:

there are no explanations in the original paper. And I didn't find explanations in tutorials (like here https://kazemnejad.com/blog/transformer_architecture_positional_encoding/).

I don't understand this:

"As each word in a sentence simultaneously flows through the Transformer’s encoder/decoder stack, The model itself doesn’t have any sense of position/order for each word."

From my point of view - transformer encoder has info about the order because its input is an ordered sequence (similar to RNN).

I tried to remove positional encoding from the model. It works, but with a worse performance.

Is it useful to add such positional encoding to RNN ? Could it improve its performance ?

How do you define "necessary"? Will the model still work? Yes, even in machine translation, but not using it will worsen the performance. — Mohammad Arvan, Apr 27 '20 at 04:41

score 9 · Accepted Answer · answered Sep 18 '20 at 02:01

9

This research group claims positional encoding is not necessary: https://arxiv.org/abs/1905.04226

answered Sep 18 '20 at 02:01

Yaroslav Bulatov

57,332
22
139
197

And from my personal experience, both Kazuki himself and the RWTH-i6 group in general are to be trusted. – dedObed Dec 08 '22 at 11:34
Note that this is about decoder transformers and not encoder transformers. – Denziloe Jun 12 '23 at 02:32

score 0 · Answer 2 · answered Nov 04 '21 at 20:47

0

I saw the following video, https://www.youtube.com/watch?v=S27pHKBEp30 In which, he says at timestamp about 16:00, that without positional encoding attention mechanism is just a 'Bag of Words'.

answered Nov 04 '21 at 20:47

Bhavin

1
1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Nov 05 '21 at 01:37
While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/late-answers/30261193) – user12256545 Nov 05 '21 at 14:29

Tom Huntington · Answer 3 · 2023-08-18T07:54:16.313

Taken from https://jalammar.github.io/illustrated-transformer/

Unmasked self-attention is invariant

Changing the order of words, will permute the order of the rows of V, but will also permute the order of the columns of the correlation matrix Q x transpose(K). Thus the resulting output will be unchanged and positional information will be lost after the first self attention layer.

To solve this you encode the position into the embedding of each word, so the neural net can learn to take two embeddings and know how far they are apart no matter the order they are fed in.

From the abstract that claimed positional encoding is not necessary:

The positional encoding is an essential augmentation for the self-attention mechanism which is invariant to sequence ordering.

By this they meant unmasked self attention is invariant.

Masked self-attention is not invariant

However, masking self-attention changes this, as it means that

information increases along the positional dimension which is a positional signal by its own

As in, for the *specific* self-attention mechanism which is invariant to sequence ordering. By this I believe they mean the encoder transformer (like BERT) but *not* the decoder transformer (like GPT), which is what the paper is about. — Denziloe, Jun 12 '23 at 02:32

Is positional encoding necessary for transformer in language modeling?

3 Answers3

Unmasked self-attention is invariant

Masked self-attention is not invariant

Linked