I use tokenizers
to tokenize natural language sentences into tokens.
But came up with some questions:
Here is some examples I tried using tokenizers:
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
tokenizer("是")
# {'input_ids': [42468], 'attention_mask': [1]}
tokenizer("我说你倒是快点啊")
# {'input_ids': [22755, 239, 46237, 112, 19526, 254, 161, 222, 240, 42468, 33232, 104, 163, 224, 117, 161, 243, 232], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
tokenizer("東")
# {'input_ids': [30266, 109], 'attention_mask': [1, 1]}
tokenizer("東京")
# {'input_ids': [30266, 109, 12859, 105], 'attention_mask': [1, 1, 1, 1]}
tokenizer("東京メトロ")
# {'input_ids': [30266, 109, 12859, 105, 26998, 13298, 16253], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
tokenizer("メトロ")
# {'input_ids': [26998, 13298, 16253], 'attention_mask': [1, 1, 1]}
tokenizer("This is my fault")
{'input_ids': [1212, 318, 616, 8046], 'attention_mask': [1, 1, 1, 1]}
The code above is some examples I tried.
The last example is an English sentence and I can understand that This
corresponds to "This":1212
in the vocab.json
, is
corresponds to "\u0120is": 318
.
But I can not understand why this tool tokenizes non-English sequence into some tokens I can not find in the vocab.
For example:
東
is been tokenized into 30266 and 109
. The results in the vocab.json
is "æĿ":30266
and "±":109
.
メ
is been tokenized into 26998
. The results in the vocab.json
is "ãĥ¡":26998
.
I searched the Huggingface documents and website and find no clue.
And the source code is written in Rust, which is hard for me to understand. So could you help me figure out why?