I want to add new words to my BPE tokenizer. I know the symbol Ġ means the end of a new token and the majority of tokens in vocabs of pre-trained tokenizers start with Ġ. Assume I want to add the word Salah to my tokenizer. I tried to add both Salah token and ĠSalah:
tokenizer.add_tokens(['Salah', 'ĠSalah']) # they get 50265 and 50266 values respectively.
However, when I tokenize a sentence where Salah appears, the tokenizer will never return me the second number (neither when using .tokenize
nor.encode
), for instance:
tokenizer.tokenize('I love Salah and salad')
returns ['I', 'Ġlove', 'Salah', 'Ġand', 'Ġsalad']
.
The question is: should I use the symbol Ġ
when adding new tokens or the tokenizer does it itself? Or, probably, it must be specified manually?
Thanks in advance!
Asked
Active
Viewed 1,061 times
2
-
Are you aware that you have to pretrain the model if you want to add custom tokens? – dennlinger Jun 05 '20 at 17:43
-
@dennlinger sure, but to make the fine-tuning better I need to add some popular tokens of the target dataset. – Akim Jun 05 '20 at 18:09