OpenAI GPT-3 API: How does it count tokens for different languages?

Question

We all know that GPT-3 models can accept and produce all kinds of languages such as English, French, Chinese, Japanese and so on.

In traditional NLP, different languages have different token-making methods.

For those Alphabetic Languages such as English, Bert use BPE method to make tokens like below:

Insomnia caused much frustration.
==>
In-, som-, nia, caus-, ed, much, frus-, tra-, tion, .,

For those Charactaristic Languages such as Chinese or Japanese, just use the character itself as the token like below.

東京メトロは心に寄り添う
==>
東, 京, メ, ト, ロ, は, 心, に, 寄, り, 添, う,

我说你倒是快点啊！！！
==>
我, 说, 你, 倒, 是, 快, 点, 啊, ！, ！, ！,

But for GPT-3, it composes of different languages and can produce both Chinese and English in one sentence. So I am really curious how this model makes tokens.

score 2 · Answer 1 · answered Feb 15 '23 at 10:23

2

Use the Tokenizer to understand how a piece of text would be tokenized by the OpenAI API.

For example, Insomnia caused much frustration. would be tokenized as 6 tokens.

Whereas, 我说你倒是快点啊！！! would be tokenized as 27 tokens with a slight note at the bottom:

Note: Your input contained one or more unicode characters that map to multiple tokens. The output visualization may display the bytes in each token in a non-standard way.

answered Feb 15 '23 at 10:23

Rok Benko

14,265
2
24
49

But I am still confused about the Chinese and other Unicode charactors such as emojis, Japanese, Korean characters when tokenizing. cause the code is written in [tokenizer](https://github.com/huggingface/tokenizers) in Rust. – dongrixinyu Feb 16 '23 at 03:18
Hm, I searched a bit, but found nothing special other than [this](https://github.com/openai/CLIP/issues/7). – Rok Benko Feb 16 '23 at 09:15
1

Thank you very much still. Cause I tried chatgpt and it can smoothly handle both chinese and english in only one reply.so I am really curious about thetokenization method. Cause I assume it must be different from the originl chinese character tokenization. – dongrixinyu Feb 16 '23 at 10:18
I find facebook provide an open source repo [muse](https://github.com/facebookresearch/MUSE), which might be thought provoking. – dongrixinyu Feb 17 '23 at 07:47
How could it help you to figure out OpenAI API tokenization? – Rok Benko Feb 17 '23 at 09:01
1

It doesnt help me. It use a different way from BPE token generation. – dongrixinyu Feb 20 '23 at 07:23

OpenAI GPT-3 API: How does it count tokens for different languages?

1 Answers1