How to stop BERT from breaking apart specific words into word-piece

Question

I am using a pre-trained BERT model to tokenize a text into meaningful tokens. However, the text has many specific words and I don't want BERT model to break them into word-pieces. Is there any solution to it? For example:

tokenizer = BertTokenizer('bert-base-uncased-vocab.txt')
tokens = tokenizer.tokenize("metastasis")

Create tokens like this:

['meta', '##sta', '##sis']

However, I want to keep the whole words as one token, like this:

['metastasis']

Maybe `' '.join([x for x in tokens]).replace(' ##', '')` will do? — Wiktor Stribiżew, May 29 '20 at 09:39
thanks for your answer, but I can't do this because I want these word pieces for other words (non-specific ones). for example extracting :['extract', '##ing'] — parvaneh shayegh, May 29 '20 at 10:06
You do not usually need this, subword tokenization is very useful when solving OOV words and helps decrease the vocabulary size. Why do you need to add exceptions? — Wiktor Stribiżew, May 29 '20 at 10:14
Please fix me if I am wrong but in my example, tokens for 'metastasis' are 'meta' and 'sta' and 'sis'. However, I want to keep 'metastasis' as one whole word because it doesn't have any relation to 'meta'. — parvaneh shayegh, May 29 '20 at 10:21

prosti · Answer 1 · 2020-05-29T17:26:12.327

You are free to add new tokens to the existing pretrained tokenizer, but then you need to train your model with the improved tokenizer (extra tokens).

Example:

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
v = tokenizer.get_vocab()
print(len(v))
tokenizer.add_tokens(['whatever', 'underdog'])
v = tokenizer.get_vocab()
print(len(v))

If token already exists like 'whatever' it will not be added.

Output:

30522
30523

score 1 · Answer 2 · answered Mar 23 '21 at 08:47

1

I think if I use the solution, like

tokenizer.add_tokens(['whatever', 'underdog'])

the vocab_size is changed, this means that I can not use pretrain model from transformers? because the embedding size is not correct.

answered Mar 23 '21 at 08:47

xuan zhou

11
1

parvaneh shayegh · Answer 3 · 2020-06-02T07:37:25.923

0

Based on the discussion here, one way to use my own additional vocabulary dictionary which is containing the specific words is to modify the first ~1000 lines of the vocab.txt file ([unused] lines) with the specific words. For example I replaced '[unused1]' with 'metastasis' in the vocab.txt and after tokenization with the modified vocab.txt I got this output:

tokens = tokenizer.tokenize("metastasis")
Output: ['metastasis']

edited Jun 02 '20 at 07:37

answered May 29 '20 at 11:20

parvaneh shayegh

507
5
13

1

Your link to discussion is broken. – Wiktor Stribiżew May 29 '20 at 15:47

How to stop BERT from breaking apart specific words into word-piece

3 Answers3