How truncation works when applying BERT tokenizer on the batch of sentence pairs in HuggingFace?

Question

Say, I have three sample sentences:

s0 = "This model was pretrained using a specific normalization pipeline available here!"
s1 = "Thank to all the people around,"
s2 = "Bengali Mask Language Model for Bengali Language"

I could make a batch like:

batch = [[s[0], s[1]], [s[1], s[2]]]

Now, if I apply the BERT tokenizer on the sentence pairs, it truncates the sentence pairs if the length exceeds in such a way that the ultimate sum of the sentence pairs' lengths meets the max_length parameter, which was supposed to be done, okay. Here's what I meant:

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForPreTraining.from_pretrained("bert-base-uncased")

encoded = tokenizer(batch, padding="max_length", truncation=True, max_length=10)["input_ids"]
decoded = tokenizer.batch_decode(encoded)
print(decoded)

>>>Output: ['[CLS] this model was pre [SEP] thank to all [SEP]', '[CLS] thank to all [SEP] bengali mask language model [SEP]']

My question is, how does the truncation work here in the pair of sentences where the number of tokens from each sentence of each pair is not equal?

For example, in the first example output '[CLS] this model was pre [SEP] thank to all [SEP]' number of tokens from the two sentences has not come equally i.e [CLS] 4 tokens [SEP] 3 tokens [SEP].

score 0 · Answer 1 · answered May 26 '22 at 20:31

There are different truncation strategies you can choose from:

True or 'longest_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
'only_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
'only_second': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
False or 'do_not_truncate' (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).

How truncation works when applying BERT tokenizer on the batch of sentence pairs in HuggingFace?

1 Answers1