1

decoded should be equal to text but:

import tokenizers

text = "Hello World!"
tokenizer  = tokenizers.Tokenizer(tokenizers.models.Unigram())
tokenizer.train_from_iterator(text)
encoded = tokenizer.encode(text)
decoded = tokenizer.decode(encoded.ids)
print(decoded)
# 'H e l l o   W o r l d !'

how can i change the tokenizer to reflect the desired output?

Yorai Levi
  • 473
  • 5
  • 17

0 Answers0