I follow this tutorial How to train a new language model from scratch using Transformers and Tokenizers.
In Section 2. Train a tokenizer, after training by my own Vietnamese text data, I look at the .vocab file generated, all the tokens become like this:
"ĠÄij":268,"nh":269,"á»§":270,"Ãł":271,"Ġch":272,"iá»":273,"á":274,"Ġl":275,"Ġb":276,"ư":277,"Ġh":278,"ế":279,
any idea to fix this?