Why does GPT-2 vocab contain weird words?

Question

I was looking at the vocabulary of GPT-2.

https://huggingface.co/gpt2/blob/main/vocab.json

I found to my surprise very weird tokens that I did not expect. For example, it contains the token (index 35496): ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ

How did this happen is this token common in the GPT-2 training data?? In general, how was the vocabulary for GPT-2 built, and is there a problem here?

score 1 · Accepted Answer · answered Jan 23 '23 at 11:55

1

Information about the model available here https://huggingface.co/gpt2

The OpenAI team wanted to train this model on a corpus as large as possible. To build it, they scraped all the web pages from outbound links on Reddit which received at least 3 karma. Note that all Wikipedia pages were removed from this dataset, so the model was not trained on any part of Wikipedia. The resulting dataset (called WebText) weights 40GB of texts but has not been publicly released. You can find a list of the top 1,000 domains present in WebText here.

Accordingly to hugging face GPT2Tokenizer, the tokenizer is based on BPE, such token could have ended up there due to an encoding issue.

You can see that this is the the char codes for ÃÂ are 195, and 194, C3 C2 that could be a two-byte encoded character in a different encoding? Or part of binary data that leaked into the corpus?

If that token was not frequent it is likely that it will never be relevant at the output. But it is an issue in the sense that the model wastes resources describing the behavior for that token.

answered Jan 23 '23 at 11:55

Bob

13,867
1
5
27

This is not the only weird token. So this would mean that a LOT of resources are wasted on many non meaningful tokens – Daniel Jan 24 '23 at 14:47
What fraction 5%? we can guess that the input layer will have the same increase in number of parameters. Machine learning means that you give the dataset, and looks at the scores for the target problems. It is a pragmatic decision, if it is better than the previous state of the art, it is worth publishing and maybe using. For many use cases the cost of retraining is not worth the benefit, unless you are in a resource constrained environment. – Bob Jan 24 '23 at 17:23
seems more like 1%. So I guess you are right this wouldn't make much of a difference. Only perhaps in the training process If you are training on a lot of unclean data. – Daniel Jan 25 '23 at 21:22
Also if you see 1% of weird tokens, they are probably infrequent and thus they should be much less than 1% of the training data. Overall it should not be noticeable in the training process as well. – Bob Jan 26 '23 at 09:39

Why does GPT-2 vocab contain weird words?

1 Answers1