3

I have some text which I want to perform NLP on. To do so, I download a pre-trained tokenizer like so:

import transformers as ts

pr_tokenizer = ts.AutoTokenizer.from_pretrained('distilbert-base-uncased', cache_dir='tmp')

Then I create my own tokenizer with my data like this:

from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()

tokenizer.train(['transcripts.raw'], trainer)

Now comes the part where I get confused... I need to update the entries in the pretraned tokenizer (pr_tokenizer) where they are the keys are the same as in my tokenizer (tokenizer). I have tried several methods, so here is one of them:

new_vocab = pr_tokenizer.vocab
v = tokenizer.get_vocab()

for i in v:
    if i in new_vocab:
        new_vocab[i] = v[i]

So what do I do now? I was thinking something like:

pr_tokenizer.vocab.update(new_vocab)

or

pr_tokenizer.vocab = new_vocab

Neither work. Does anyone know a good way of doing this?

petezurich
  • 9,280
  • 9
  • 43
  • 57
user9102437
  • 600
  • 1
  • 10
  • 24

2 Answers2

2

To do that, you can just download the tokenizer source from GitHub or the HuggingFace website into the same folder as your code, and then edit the vocabulary before the tokenizer is loaded:

new_vocab = {}

# Getting the vocabulary entries
for i, row in enumerate(open('./distilbert-base-uncased/vocab.txt', 'r')): 
    new_vocab[row[:-1]] = i

# your vocabulary entries
v = tokenizer.get_vocab()

# replace common (your code)
for i in v:
    if i in new_vocab:
        new_vocab[i] = v[i]

with open('./distilbert-base-uncased/vocabb.txt', 'w') as f:
    # reversed vocabulary
    rev_vocab = {j:i for i,j in zip(new_vocab.keys(), new_vocab.values())}
    # adding vocabulary entries to file
    for i in range(len(rev_vocab)):
        if i not in rev_vocab: continue
        f.write(rev_vocab[i] + '\n')

# loading the new tokenizer
pr_tokenizer = ts.AutoTokenizer.from_pretrained('./distilbert-base-uncased')
Igor Igor
  • 305
  • 1
  • 7
  • You know what guys, i didnt get why are u two need to do this, will not you gonna use pretrained model? Why AutoTokenizer.from_pretrained('./distilbert-base-uncased') is not enough? – bitbang Nov 02 '21 at 14:22
  • 1
    @bitbang I wanted to "fine-tune" the tokenizer, and this seems to be the only wat that I have found on the internet. – user9102437 Nov 03 '21 at 10:03
  • I believe tokens are only integers that NLP models consumes and converts them to vectors with their understandings – bitbang Nov 03 '21 at 14:43
  • 1
    @user9102437 FYI, instead of editing an existing tokenizer, you can also train a new one – SilentCloud Nov 03 '21 at 15:58
  • 1
    @SilentCloud Surely, but it probably won't be as good as the one trained using powerfull machines on big data (If the data was similar that is). – user9102437 Nov 03 '21 at 16:46
1

If you can find distilbert folder in your pc, you can see vocabulary is basically txt file that contains only one column. You can do whatever you want to do.

# i download the model with pasting this line to python terminal (or your main cli)
# git clone https://huggingface.co/distilbert-base-uncased

import os
path= "C:/Users/...../distilbert-base-uncased"
print(os.listdir(path))

# ['.git',
# '.gitattributes',
# 'config.json',
# 'flax_model.msgpack',
# 'pytorch_model.bin',
# 'README.md', 'rust_model.ot',
# 'tf_model.h5',
# 'tokenizer.json',
# 'tokenizer_config.json',
# 'vocab.txt']
bitbang
  • 1,804
  • 14
  • 18
  • yes, you can add words to this file, but if we want to remove some tokens from this tokenizer? Can this be done? Will the model still operate without any problem? – mitra mirshafiee Jan 05 '22 at 12:32
  • Model understand words as a token as you know, 205(token) = some word. if you change 205th token to some other word, you need o re train your model with your new vocabulary and new tokens – bitbang Jan 05 '22 at 20:42