We all know that GPT-3 models can accept and produce all kinds of languages such as English, French, Chinese, Japanese and so on.
In traditional NLP, different languages have different token-making methods.
- For those Alphabetic Languages such as English,
Bert
use BPE method to make tokens like below:
Insomnia caused much frustration.
==>
In-, som-, nia, caus-, ed, much, frus-, tra-, tion, .,
- For those Charactaristic Languages such as Chinese or Japanese, just use the character itself as the token like below.
東京メトロは心に寄り添う
==>
東, 京, メ, ト, ロ, は, 心, に, 寄, り, 添, う,
我说你倒是快点啊!!!
==>
我, 说, 你, 倒, 是, 快, 点, 啊, !, !, !,
But for GPT-3, it composes of different languages and can produce both Chinese and English in one sentence. So I am really curious how this model makes tokens.