2

I want to add new words to my BPE tokenizer. I know the symbol Ġ means the end of a new token and the majority of tokens in vocabs of pre-trained tokenizers start with Ġ. Assume I want to add the word Salah to my tokenizer. I tried to add both Salah token and ĠSalah: tokenizer.add_tokens(['Salah', 'ĠSalah']) # they get 50265 and 50266 values respectively. However, when I tokenize a sentence where Salah appears, the tokenizer will never return me the second number (neither when using .tokenizenor.encode), for instance: tokenizer.tokenize('I love Salah and salad') returns ['I', 'Ġlove', 'Salah', 'Ġand', 'Ġsalad']. The question is: should I use the symbol Ġ when adding new tokens or the tokenizer does it itself? Or, probably, it must be specified manually? Thanks in advance!

Guy Coder
  • 24,501
  • 8
  • 71
  • 136
Akim
  • 139
  • 6
  • Are you aware that you have to pretrain the model if you want to add custom tokens? – dennlinger Jun 05 '20 at 17:43
  • @dennlinger sure, but to make the fine-tuning better I need to add some popular tokens of the target dataset. – Akim Jun 05 '20 at 18:09

0 Answers0