Tokenizers change vocabulary entry

Question

I have some text which I want to perform NLP on. To do so, I download a pre-trained tokenizer like so:

import transformers as ts

pr_tokenizer = ts.AutoTokenizer.from_pretrained('distilbert-base-uncased', cache_dir='tmp')

Then I create my own tokenizer with my data like this:

from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()

tokenizer.train(['transcripts.raw'], trainer)

Now comes the part where I get confused... I need to update the entries in the pretraned tokenizer (pr_tokenizer) where they are the keys are the same as in my tokenizer (tokenizer). I have tried several methods, so here is one of them:

new_vocab = pr_tokenizer.vocab
v = tokenizer.get_vocab()

for i in v:
    if i in new_vocab:
        new_vocab[i] = v[i]

So what do I do now? I was thinking something like:

pr_tokenizer.vocab.update(new_vocab)

or

pr_tokenizer.vocab = new_vocab

Neither work. Does anyone know a good way of doing this?

score 2 · Accepted Answer · answered Nov 02 '21 at 10:48

2

To do that, you can just download the tokenizer source from GitHub or the HuggingFace website into the same folder as your code, and then edit the vocabulary before the tokenizer is loaded:

new_vocab = {}

# Getting the vocabulary entries
for i, row in enumerate(open('./distilbert-base-uncased/vocab.txt', 'r')): 
    new_vocab[row[:-1]] = i

# your vocabulary entries
v = tokenizer.get_vocab()

# replace common (your code)
for i in v:
    if i in new_vocab:
        new_vocab[i] = v[i]

with open('./distilbert-base-uncased/vocabb.txt', 'w') as f:
    # reversed vocabulary
    rev_vocab = {j:i for i,j in zip(new_vocab.keys(), new_vocab.values())}
    # adding vocabulary entries to file
    for i in range(len(rev_vocab)):
        if i not in rev_vocab: continue
        f.write(rev_vocab[i] + '\n')

# loading the new tokenizer
pr_tokenizer = ts.AutoTokenizer.from_pretrained('./distilbert-base-uncased')

answered Nov 02 '21 at 10:48

Igor Igor

305
1
7

You know what guys, i didnt get why are u two need to do this, will not you gonna use pretrained model? Why AutoTokenizer.from_pretrained('./distilbert-base-uncased') is not enough? – bitbang Nov 02 '21 at 14:22
1

@bitbang I wanted to "fine-tune" the tokenizer, and this seems to be the only wat that I have found on the internet. – user9102437 Nov 03 '21 at 10:03
I believe tokens are only integers that NLP models consumes and converts them to vectors with their understandings – bitbang Nov 03 '21 at 14:43
1

@user9102437 FYI, instead of editing an existing tokenizer, you can also train a new one – SilentCloud Nov 03 '21 at 15:58
1

@SilentCloud Surely, but it probably won't be as good as the one trained using powerfull machines on big data (If the data was similar that is). – user9102437 Nov 03 '21 at 16:46

bitbang · Answer 2 · 2021-11-02T02:16:28.143

1

If you can find distilbert folder in your pc, you can see vocabulary is basically txt file that contains only one column. You can do whatever you want to do.

# i download the model with pasting this line to python terminal (or your main cli)
# git clone https://huggingface.co/distilbert-base-uncased

import os
path= "C:/Users/...../distilbert-base-uncased"
print(os.listdir(path))

# ['.git',
# '.gitattributes',
# 'config.json',
# 'flax_model.msgpack',
# 'pytorch_model.bin',
# 'README.md', 'rust_model.ot',
# 'tf_model.h5',
# 'tokenizer.json',
# 'tokenizer_config.json',
# 'vocab.txt']

edited Nov 02 '21 at 02:16

answered Nov 02 '21 at 02:07

bitbang

1,804
14
18

yes, you can add words to this file, but if we want to remove some tokens from this tokenizer? Can this be done? Will the model still operate without any problem? – mitra mirshafiee Jan 05 '22 at 12:32
Model understand words as a token as you know, 205(token) = some word. if you change 205th token to some other word, you need o re train your model with your new vocabulary and new tokens – bitbang Jan 05 '22 at 20:42

Tokenizers change vocabulary entry

2 Answers2