Unable to find the word that I added to the Huggingface Bert tokenizer vocabulary

Question

I tried to add new words to the Bert tokenizer vocab. I see that the length of the vocab is increasing, however I can't find the newly added word in the vocab.

tokenizer.add_tokens(['covid', 'wuhan'])

v = tokenizer.get_vocab()

print(len(v))
'covid' in tokenizer.vocab

Output:

30524

False

score 2 · Accepted Answer · answered Dec 24 '20 at 22:47

2

You are calling two different things with tokenizer.vocab and tokenizer.get_vocab(). The first one contains the base vocabulary without the added tokens, while the other one contains the base vocabulary with the added tokens.

from transformers import BertTokenizer

t = BertTokenizer.from_pretrained('bert-base-uncased')

print(len(t.vocab))
print(len(t.get_vocab()))
print(t.get_added_vocab())
t.add_tokens(['covid'])
print(len(t.vocab))
print(len(t.get_vocab()))
print(t.get_added_vocab())

Output:

30522
30522
{}
30522
30523
{'covid': 30522}

answered Dec 24 '20 at 22:47

cronoik

15,434
3
40
78

thank you for responding. But I see that the size off the tokeniser has increased but when I do tokenizer.ids_to_tokens[30522] after I add a token I get KeyError: 30522 . What could I be doing wrong here. How do I see the added token? – Jagadish Vishwanatham Dec 25 '20 at 01:32
@JagadishVishwanatham You can view all of the added tokens as shown in my answer with `t.get_added_vocab()` or with `t.convert_ids_to_tokens(30522)`. for a single token. – cronoik Dec 25 '20 at 06:16

Unable to find the word that I added to the Huggingface Bert tokenizer vocabulary

1 Answers1