2

I'm implementing self-attention part in transformer encoder using pytorch nn.MultiheadAttention and confusing in the padding masking of transformer.

The following picture shows the self-attention weight of the query (row) and key (column).

As you can see, there are some tokens "<PAD>" and I have already mask it in key. Therefore the tokens will not calculate the attention weight.

enter image description here

There are still two questions:

  1. In query part, can I also mask them("<PAD>") except for the red square part? Is this reasonable?

  2. How can I mask "<PAD>" in the query?

The attention weights also use the softmax function along the row by giving mask in src_mask or src_key_padding_mask argument. If I set all the "<PAD>" row into -inf, the softmax will return nan and the loss with be nan

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Ian
  • 99
  • 11

1 Answers1

1

There is no need to mask the queries during self-attention, it should be enough if do not use the states corresponding to the <PAD> tokens later in the network (either as hidden states or keys/values), they will not influence the loss function nor anything else in the network.

If you want to make sure that you did not make a bug causing the gradient flowing through the <PAD> tokens you can explicitly zero-out the self-attention using torch.where after it is computed.

Jindřich
  • 10,270
  • 2
  • 23
  • 44
  • sorry, I can't understand why that's no need to mask the queries during self-attention. because I got bad performance in my task. So, I'm considering the where is the problem. – Ian Dec 15 '20 at 17:44
  • About gradient flowing, I'm really sure if I set `-inf` to all row6 ~ row13 it will return nan, because the attention weight pass the softmax function along each row. According to the pytorch source code https://github.com/pytorch/pytorch/blob/778006918c31c3fa0ca3794575a65c1f854f861b/torch/nn/functional.py#L4297 – Ian Dec 15 '20 at 17:58
  • You can do the `torch.where` _after_ the softmax and set the weights to zero. – Jindřich Dec 16 '20 at 08:41