Uni-directional Transformer VS Bi-directional BERT

Question

I just finished reading the Transformer paper and BERT paper. But couldn't figure out why Transformer is uni-directional and BERT is bi-directional as mentioned in BERT paper. As they don't use recurrent networks, it's not so straightforward to interpret the directions. Can anyone give some clue? Thanks.

score 13 · Accepted Answer · answered Mar 12 '19 at 13:26

To clarify, the original Transformer model from Vaswani et al. is an encoder-decoder architecture. Therefore the statement "Transformer is uni-directional" is misleading.

In fact, the transformer encoder is bi-directional, which means that the self-attention can attend to tokens both on the left and right. In contrast, the decoder is uni-directional, since while generating text one token at a time, you cannot allow the decoder to attend to the right of the current token. The transformer decoder constrains the self-attention by masking the tokens to the right.

BERT uses the transformer encoder architecture and can therefore attend both to the left and right, resulting in "bi-directionality".

From the BERT paper itself:

We note that in the literature the bidirectional Transformer is often referred to as a “Transformer encoder” while the left-context-only version is referred to as a “Transformer decoder” since it can be used for text generation.

Uni-directional Transformer VS Bi-directional BERT

1 Answers1