Adding tokens to GPT-2 BPE tokenizer

Asked Jun 05 '20 at 15:56

Active Nov 29 '20 at 12:05

Viewed 1,061 times

I want to add new words to my BPE tokenizer. I know the symbol Ġ means the end of a new token and the majority of tokens in vocabs of pre-trained tokenizers start with Ġ. Assume I want to add the word Salah to my tokenizer. I tried to add both Salah token and ĠSalah: tokenizer.add_tokens(['Salah', 'ĠSalah']) # they get 50265 and 50266 values respectively. However, when I tokenize a sentence where Salah appears, the tokenizer will never return me the second number (neither when using .tokenizenor.encode), for instance: tokenizer.tokenize('I love Salah and salad') returns ['I', 'Ġlove', 'Salah', 'Ġand', 'Ġsalad']. The question is: should I use the symbol Ġ when adding new tokens or the tokenizer does it itself? Or, probably, it must be specified manually? Thanks in advance!

edited Nov 29 '20 at 12:05

Guy Coder

24,501
8
71
136

asked Jun 05 '20 at 15:56

Akim

Are you aware that you have to pretrain the model if you want to add custom tokens? – dennlinger Jun 05 '20 at 17:43
@dennlinger sure, but to make the fine-tuning better I need to add some popular tokens of the target dataset. – Akim Jun 05 '20 at 18:09

Adding tokens to GPT-2 BPE tokenizer

0 Answers0