How does Huggingface's tokenizers tokenize non-English characters?

Question

I use tokenizers to tokenize natural language sentences into tokens.

But came up with some questions:

Here is some examples I tried using tokenizers:

from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
tokenizer("是")
# {'input_ids': [42468], 'attention_mask': [1]}
tokenizer("我说你倒是快点啊")
# {'input_ids': [22755, 239, 46237, 112, 19526, 254, 161, 222, 240, 42468, 33232, 104, 163, 224, 117, 161, 243, 232], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
tokenizer("東")
# {'input_ids': [30266, 109], 'attention_mask': [1, 1]}
tokenizer("東京")
# {'input_ids': [30266, 109, 12859, 105], 'attention_mask': [1, 1, 1, 1]}
tokenizer("東京メトロ")
# {'input_ids': [30266, 109, 12859, 105, 26998, 13298, 16253], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
tokenizer("メトロ")
# {'input_ids': [26998, 13298, 16253], 'attention_mask': [1, 1, 1]}
tokenizer("This is my fault")
{'input_ids': [1212, 318, 616, 8046], 'attention_mask': [1, 1, 1, 1]}

The code above is some examples I tried. The last example is an English sentence and I can understand that This corresponds to "This":1212 in the vocab.json, is corresponds to "\u0120is": 318.

But I can not understand why this tool tokenizes non-English sequence into some tokens I can not find in the vocab. For example: 東 is been tokenized into 30266 and 109. The results in the vocab.json is "æĿ":30266 and "±":109. メ is been tokenized into 26998. The results in the vocab.json is "ãĥ¡":26998.

I searched the Huggingface documents and website and find no clue.

And the source code is written in Rust, which is hard for me to understand. So could you help me figure out why?

How does Huggingface's tokenizers tokenize non-English characters?

0 Answers0