Why does Transformer's BERT (for sequence classification) output depend heavily on maximum sequence length padding?

Question

I am using Transformer's RobBERT (the dutch version of RoBERTa) for sequence classification - trained for sentiment analysis on the Dutch Book Reviews dataset.

I wanted to test how well it works on a similar dataset (also on sentiment analysis), so I made annotations for a set of text fragments and checked its accuracy. When I checked what kind of sentence are misclassified, I noticed that the output for a unique sentence depends heavily on the length of padding I give when tokenizing. See code below.

from transformers import RobertaTokenizer, RobertaForSequenceClassification
import torch.nn.functional as F
import torch


model = RobertaForSequenceClassification.from_pretrained("pdelobelle/robBERT-dutch-books", num_labels=2)
tokenizer = RobertaTokenizer.from_pretrained("pdelobelle/robBERT-dutch-books", do_lower_case=True)

sent = 'De samenwerking gaat de laatste tijd beter'
max_seq_len = 64


test_token = tokenizer(sent,
                        max_length = max_seq_len,
                        padding = 'max_length',
                        truncation = True,
                        return_tensors = 'pt'
                        )

out = model(test_token['input_ids'],test_token['attention_mask'])

probs = F.softmax(out[0], dim=1).detach().numpy()

For the given sample text, which translates in English to "The collaboration has been improving lately", there is a huge difference in output on classification depending on the max_seq_len. Namely, for max_seq_len = 64 the output for probs is:

[[0.99149346 0.00850648]]

whilst for max_seq_len = 9, being the actual length including cls tokens:

[[0.00494814 0.9950519 ]]

Can anyone explain why this huge difference in classification is happening? I would think that the attention mask ensures that in the output there is no difference because of padding to the max sequence length.

score 3 · Accepted Answer · answered May 31 '21 at 21:07

This is caused because your comparison isn't correct. The sentence De samenwerking gaat de laatste tijd beter has actually 16 tokens (+2 for the specialtokens) and not 9. You only counted the words which are not necessarily the tokens.

print(tokenizer.tokenize(sent))
print(len(tokenizer.tokenize(sent)))

Output:

['De', 'Ġsam', 'en', 'wer', 'king', 'Ġga', 'at', 'Ġde', 'Ġla', 'at', 'ste', 'Ġt', 'ij', 'd', 'Ġbe', 'ter']
16

When you set the sequence length to 9 you are truncating the sentence to:

tokenizer.decode(tokenizer(sent,
                         max_length = 9,
                         padding = 'max_length',
                         truncation = True,
                         return_tensors = 'pt', 
                         add_special_tokens=False
                         )['input_ids'][0])

Output:

'De samenwerking gaat de la'

And as final prove, the output when you set max_length to 52 is also [[0.99149346 0.00850648]].

Why does Transformer's BERT (for sequence classification) output depend heavily on maximum sequence length padding?

1 Answers1