I am using Transformer's RobBERT (the dutch version of RoBERTa) for sequence classification - trained for sentiment analysis on the Dutch Book Reviews dataset.
I wanted to test how well it works on a similar dataset (also on sentiment analysis), so I made annotations for a set of text fragments and checked its accuracy. When I checked what kind of sentence are misclassified, I noticed that the output for a unique sentence depends heavily on the length of padding I give when tokenizing. See code below.
from transformers import RobertaTokenizer, RobertaForSequenceClassification
import torch.nn.functional as F
import torch
model = RobertaForSequenceClassification.from_pretrained("pdelobelle/robBERT-dutch-books", num_labels=2)
tokenizer = RobertaTokenizer.from_pretrained("pdelobelle/robBERT-dutch-books", do_lower_case=True)
sent = 'De samenwerking gaat de laatste tijd beter'
max_seq_len = 64
test_token = tokenizer(sent,
max_length = max_seq_len,
padding = 'max_length',
truncation = True,
return_tensors = 'pt'
)
out = model(test_token['input_ids'],test_token['attention_mask'])
probs = F.softmax(out[0], dim=1).detach().numpy()
For the given sample text, which translates in English to "The collaboration has been improving lately", there is a huge difference in output on classification depending on the max_seq_len. Namely, for max_seq_len = 64
the output for probs
is:
[[0.99149346 0.00850648]]
whilst for max_seq_len = 9
, being the actual length including cls tokens:
[[0.00494814 0.9950519 ]]
Can anyone explain why this huge difference in classification is happening? I would think that the attention mask ensures that in the output there is no difference because of padding to the max sequence length.