I am trying to Implement the BiLSTM-Attention-CRF model for the NER task. I am able to perform NER tasks based on the BILSTM-CRF model (code from here) but I need to add attention to improve the performance of the model.
Right now my model is :
BiLSTM -> Linear Layer (Hidden to tag) -> CRf Layer
The Output from the Linear layer is (seq. length x tagset size) and it is then fed into the CRF layer.
I am trying to replace the Linear layer with Attention layer using the code below:
class SelfAttention(nn.Module):
def __init__(self, hidden_dim):
super().__init__()
self.hidden_dim = hidden_dim
self.projection = nn.Sequential(
nn.Linear(hidden_dim, 64),
nn.ReLU(True),
nn.Linear(64, 1)
)
def forward(self, encoder_outputs):
batch_size = encoder_outputs.size(0)
# (B, L, H) -> (B , L, 1)
energy = self.projection(encoder_outputs)
weights = F.softmax(energy.squeeze(-1), dim=1)
# (B, L, H) * (B, L, 1) -> (B, H)
outputs = (encoder_outputs * weights.unsqueeze(-1)).sum(dim=1)
return outputs, weights
While doing so I have two issues:
- I can not make it work so that the output should come in the shape of (seq. length x tagset size) so that it can be fed into CRF Layer.
- According to this paper, we need to initialize and learn word-level context vector which I can not see in this implementation of the attention model.
Kindly help me out.
TIA