0

I have the text with custom tokens, like: <adjective> and I am trying to prepare a byte level tokenizer that won't split them:

tokenizer.pre_tokenizer = ByteLevel()
tokenizer.pre_tokenizer.pre_tokenize("<adjective>")

[('Ġ<', (0, 2)), ('adjective', (2, 11)), ('>', (11, 12)]

How to add <adjective> as not a special token, but a token that the tokenizer should not split?

artona
  • 1,086
  • 8
  • 13

1 Answers1

0

The new tokens to tokenizer in hugging-face transformer API can be added as follows:

tokenizer.add_tokens('<adjective>')

This would add '' as single token.

This would also require update to the model as:

model.resize_token_embeddings(len(tokenizer))

Ashwin Geet D'Sa
  • 6,346
  • 2
  • 31
  • 59