I just finished reading the Transformer paper and BERT paper. But couldn't figure out why Transformer is uni-directional and BERT is bi-directional as mentioned in BERT paper. As they don't use recurrent networks, it's not so straightforward to interpret the directions. Can anyone give some clue? Thanks.
1 Answers
To clarify, the original Transformer model from Vaswani et al. is an encoder-decoder architecture. Therefore the statement "Transformer is uni-directional" is misleading.
In fact, the transformer encoder is bi-directional, which means that the self-attention can attend to tokens both on the left and right. In contrast, the decoder is uni-directional, since while generating text one token at a time, you cannot allow the decoder to attend to the right of the current token. The transformer decoder constrains the self-attention by masking the tokens to the right.
BERT uses the transformer encoder architecture and can therefore attend both to the left and right, resulting in "bi-directionality".
From the BERT paper itself:
We note that in the literature the bidirectional Transformer is often referred to as a “Transformer encoder” while the left-context-only version is referred to as a “Transformer decoder” since it can be used for text generation.
Recommended reading: this article.

- 2,132
- 13
- 26
-
Great interpretation! I also thought the Transformer encoder was bi-directional by attending both left and right tokens. Here the uni-directional and bi-directional are a bit different from the concepts in RNN. Your response makes it very clear. – JShen Mar 12 '19 at 21:36
-
In that sense, is the decoder bidirectional as well? Basically you feed the already predicted words into the decoder, and these words can attend in both directions. – Hanhan Li Nov 23 '20 at 00:54