Encoding error: Train BERT from scratch in Vietnamese language

Question

I follow this tutorial How to train a new language model from scratch using Transformers and Tokenizers.

In Section 2. Train a tokenizer, after training by my own Vietnamese text data, I look at the .vocab file generated, all the tokens become like this:

"ĠÄĳ":268,"nh":269,"á»§":270,"Ãł":271,"Ġch":272,"iá»":273,"Ã¡":274,"Ġl":275,"Ġb":276,"Æ°":277,"Ġh":278,"áº¿":279,

any idea to fix this?

Looks like a kind of [mojibake](https://en.wikipedia.org/wiki/Mojibake). Please [edit] your question to provide a [mcve]. — JosefZ, Feb 04 '21 at 10:12

score 0 · Answer 1 · answered Jan 12 '22 at 03:28

You can read these tokens by using a ByteLevel decoder (the part used by the ByteLevelBPETokenizer) as follows:

a = tokenizer.encode("thầy giáo rất tốt.").tokens
print(a)
>> ['<s>', 'tháº§y', 'ĠgiÃ¡o', 'Ġráº¥t', 'Ġtá»ĳt', '.', '</s>']

from tokenizers.decoders import ByteLevel
decoder = ByteLevel()
decoder.decode(a)
>> <s>thầy giáo rất tốt.</s>

Encoding error: Train BERT from scratch in Vietnamese language

1 Answers1