0

We all know that GPT-3 models can accept and produce all kinds of languages such as English, French, Chinese, Japanese and so on.

In traditional NLP, different languages have different token-making methods.

  • For those Alphabetic Languages such as English, Bert use BPE method to make tokens like below:
Insomnia caused much frustration.
==>
In-, som-, nia, caus-, ed, much, frus-, tra-, tion, .,
  • For those Charactaristic Languages such as Chinese or Japanese, just use the character itself as the token like below.
東京メトロは心に寄り添う
==>
東, 京, メ, ト, ロ, は, 心, に, 寄, り, 添, う,
我说你倒是快点啊!!!
==>
我, 说, 你, 倒, 是, 快, 点, 啊, !, !, !, 

But for GPT-3, it composes of different languages and can produce both Chinese and English in one sentence. So I am really curious how this model makes tokens.

Rok Benko
  • 14,265
  • 2
  • 24
  • 49
dongrixinyu
  • 172
  • 2
  • 14

1 Answers1

2

Use the Tokenizer to understand how a piece of text would be tokenized by the OpenAI API.

For example, Insomnia caused much frustration. would be tokenized as 6 tokens.

Example 1

Whereas, 我说你倒是快点啊!!! would be tokenized as 27 tokens with a slight note at the bottom:

Note: Your input contained one or more unicode characters that map to multiple tokens. The output visualization may display the bytes in each token in a non-standard way.

Example 2

Rok Benko
  • 14,265
  • 2
  • 24
  • 49
  • But I am still confused about the Chinese and other Unicode charactors such as emojis, Japanese, Korean characters when tokenizing. cause the code is written in [tokenizer](https://github.com/huggingface/tokenizers) in Rust. – dongrixinyu Feb 16 '23 at 03:18
  • Hm, I searched a bit, but found nothing special other than [this](https://github.com/openai/CLIP/issues/7). – Rok Benko Feb 16 '23 at 09:15
  • 1
    Thank you very much still. Cause I tried chatgpt and it can smoothly handle both chinese and english in only one reply.so I am really curious about thetokenization method. Cause I assume it must be different from the originl chinese character tokenization. – dongrixinyu Feb 16 '23 at 10:18
  • I find facebook provide an open source repo [muse](https://github.com/facebookresearch/MUSE), which might be thought provoking. – dongrixinyu Feb 17 '23 at 07:47
  • How could it help you to figure out OpenAI API tokenization? – Rok Benko Feb 17 '23 at 09:01
  • 1
    It doesnt help me. It use a different way from BPE token generation. – dongrixinyu Feb 20 '23 at 07:23