2

I am trying to Implement the BiLSTM-Attention-CRF model for the NER task. I am able to perform NER tasks based on the BILSTM-CRF model (code from here) but I need to add attention to improve the performance of the model.

Right now my model is :

BiLSTM -> Linear Layer (Hidden to tag) -> CRf Layer

The Output from the Linear layer is (seq. length x tagset size) and it is then fed into the CRF layer.

I am trying to replace the Linear layer with Attention layer using the code below:

class SelfAttention(nn.Module):
    def __init__(self, hidden_dim):
            super().__init__()
            self.hidden_dim = hidden_dim
            self.projection = nn.Sequential(
                    nn.Linear(hidden_dim, 64),
                    nn.ReLU(True),
                    nn.Linear(64, 1)
            )

    def forward(self, encoder_outputs):
            batch_size = encoder_outputs.size(0)
            # (B, L, H) -> (B , L, 1)
            energy = self.projection(encoder_outputs)
            weights = F.softmax(energy.squeeze(-1), dim=1)
            # (B, L, H) * (B, L, 1) -> (B, H)
            outputs = (encoder_outputs * weights.unsqueeze(-1)).sum(dim=1)
            return outputs, weights

While doing so I have two issues:

  • I can not make it work so that the output should come in the shape of (seq. length x tagset size) so that it can be fed into CRF Layer.
  • According to this paper, we need to initialize and learn word-level context vector which I can not see in this implementation of the attention model.

Kindly help me out.

TIA

abhi8569
  • 131
  • 1
  • 9

1 Answers1

1

What you implemented is a quite unusual type of self-attention. It resembles the original self-attention for sequence classification which was probably a partial inspiration for the Attention is all you need paper.

In general, attention can be understood as a sort of probabilistic hidden state retrieval. Given some keys, you retrieve some values. In the standard Bahdanau's attention, the key is the decoder state and the values are the encoder states. In Transformer self-attention, are used as keys to retrieve some information from other states, i.e., every state is a key and a value at the same time. In the special case, you have implemented, you only have one key that is sort of encrypted in the projection. You use this single constant key to retrieve a vector from the hidden states, as a result, you only get one vector per sequence.

What you probably want to is using the Transformer-style self-attention where each state is used as a key a gets a summary of values. For that, you can use the nn.MultiheadAttention class in PyTorch. In addition to what I described, it does the attention in multiple heads, so it can do a more fine-grained retrieval. Note that in your case queries, keys and values are the same tensor, i.e., the output of the Bi-LSTM.

Jindřich
  • 10,270
  • 2
  • 23
  • 44
  • I am new to attention, a bit confused about after putting the output of the lstm into the multiheadattention class and what do we do about the output of the attnetion layer? – Nathan Chan Jun 06 '21 at 15:04