I have some text which I want to perform NLP on. To do so, I download a pre-trained tokenizer like so:
import transformers as ts
pr_tokenizer = ts.AutoTokenizer.from_pretrained('distilbert-base-uncased', cache_dir='tmp')
Then I create my own tokenizer with my data like this:
from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()
tokenizer.train(['transcripts.raw'], trainer)
Now comes the part where I get confused... I need to update the entries in the pretraned tokenizer (pr_tokenizer
) where they are the keys are the same as in my tokenizer (tokenizer
). I have tried several methods, so here is one of them:
new_vocab = pr_tokenizer.vocab
v = tokenizer.get_vocab()
for i in v:
if i in new_vocab:
new_vocab[i] = v[i]
So what do I do now? I was thinking something like:
pr_tokenizer.vocab.update(new_vocab)
or
pr_tokenizer.vocab = new_vocab
Neither work. Does anyone know a good way of doing this?