How to make byte level tokenizer not split the token?

Question

I have the text with custom tokens, like: <adjective> and I am trying to prepare a byte level tokenizer that won't split them:

tokenizer.pre_tokenizer = ByteLevel()
tokenizer.pre_tokenizer.pre_tokenize("<adjective>")

[('Ġ<', (0, 2)), ('adjective', (2, 11)), ('>', (11, 12)]

How to add <adjective> as not a special token, but a token that the tokenizer should not split?

score 0 · Answer 1 · answered Oct 27 '20 at 10:52

0

The new tokens to tokenizer in hugging-face transformer API can be added as follows:

tokenizer.add_tokens('<adjective>')

This would add '' as single token.

This would also require update to the model as:

model.resize_token_embeddings(len(tokenizer))

answered Oct 27 '20 at 10:52

Ashwin Geet D'Sa

1 Answers1