0

I follow this tutorial How to train a new language model from scratch using Transformers and Tokenizers.

In Section 2. Train a tokenizer, after training by my own Vietnamese text data, I look at the .vocab file generated, all the tokens become like this:

"ĠÄij":268,"nh":269,"á»§":270,"Ãł":271,"Ġch":272,"iá»":273,"á":274,"Ġl":275,"Ġb":276,"ư":277,"Ġh":278,"ế":279,

any idea to fix this?

cronoik
  • 15,434
  • 3
  • 40
  • 78
save_ole
  • 300
  • 1
  • 3
  • 10
  • 1
    Looks like a kind of [mojibake](https://en.wikipedia.org/wiki/Mojibake). Please [edit] your question to provide a [mcve]. – JosefZ Feb 04 '21 at 10:12

1 Answers1

0

You can read these tokens by using a ByteLevel decoder (the part used by the ByteLevelBPETokenizer) as follows:

a = tokenizer.encode("thầy giáo rất tốt.").tokens
print(a)
>> ['<s>', 'thầy', 'Ġgiáo', 'Ġrất', 'Ġtá»ijt', '.', '</s>']

from tokenizers.decoders import ByteLevel
decoder = ByteLevel()
decoder.decode(a)
>> <s>thầy giáo rất tốt.</s>