How padding in huggingface tokenizer works?

Question

I tried following tokenization example:

tokenizer = BertTokenizer.from_pretrained(MODEL_TYPE, do_lower_case=True)
sent = "I hate this. Not that.",        
_tokenized = tokenizer(sent, padding=True, max_length=20, truncation=True)
print(_tknzr.decode(_tokenized['input_ids'][0]))
print(len(_tokenized['input_ids'][0]))

The output was:

[CLS] i hate this. not that. [SEP]
9

Notice the parameter to tokenizer: max_length=20. How can I make Bert tokenizer to append 11 [PAD] tokens to this sentence to make it total 20?

score 8 · Answer 1 · answered Nov 22 '21 at 19:30

8

One should set padding="max_length":

_tokenized = tokenizer(sent, padding="max_length", max_length=20, truncation=True)

answered Nov 22 '21 at 19:30

MsA

2,599
3
22
47

padding is added in the beginning or the end? – Nathan B May 10 '23 at 10:47

How padding in huggingface tokenizer works?

1 Answers1