I am doing a sequence classification task using nn.TransformerEncoder()
. Whose pipeline is similar to nn.LSTM()
.
I have tried several temporal features fusion methods:
Selecting the final outputs as the representation of the whole sequence.
Using an affine transformation to fuse these features.
Classifying the sequence frame by frame, and then select the max values to be the category of the whole sequence.
But, all these 3 methods got a terrible accuracy, only 25% for 4 categories classification. While using nn.LSTM with the last hidden state, I can achieve 83% accuracy easily. I tried plenty of hyperparameters of nn.TransformerEncoder()
, but without any improvement for the accuracy.
I have no idea about how to adjust this model now. Could you give me some practical advice? Thanks.
For LSTM
: the forward()
is:
def forward(self, x_in, x_lengths, apply_softmax=False):
# Embed
x_in = self.embeddings(x_in)
# Feed into RNN
out, h_n = self.LSTM(x_in) #shape of out: T*N*D
# Gather the last relevant hidden state
out = out[-1,:,:] # N*D
# FC layers
z = self.dropout(out)
z = self.fc1(z)
z = self.dropout(z)
y_pred = self.fc2(z)
if apply_softmax:
y_pred = F.softmax(y_pred, dim=1)
return y_pred
For transformer
:
def forward(self, x_in, x_lengths, apply_softmax=False):
# Embed
x_in = self.embeddings(x_in)
# Feed into RNN
out = self.transformer(x_in)#shape of out T*N*D
# Gather the last relevant hidden state
out = out[-1,:,:] # N*D
# FC layers
z = self.dropout(out)
z = self.fc1(z)
z = self.dropout(z)
y_pred = self.fc2(z)
if apply_softmax:
y_pred = F.softmax(y_pred, dim=1)
return y_pred