Why using GPT2Tokenizer convert Arabic characters to symbols?

Asked Jan 26 '21 at 08:16

Active Jan 26 '21 at 18:02

Viewed 297 times

I am trying to use GPT2 for Arabic text classification task as follows:

    tokenizer = GPT2Tokenizer.from_pretrained(model_path)
    model = GPT2ForSequenceClassification.from_pretrained(model_path, 
                                                          num_labels=len(lab2ind))

However, when I use the tokenizer it converts the Arabic characters to symbols like this 'ĠÙĥØªÙĬØ±'

edited Jan 26 '21 at 18:02

asked Jan 26 '21 at 08:16

Seeker

What is `model_path`? – cronoik Jan 26 '21 at 15:05
@cronoik [aubmindlab/aragpt2-base](https://huggingface.co/aubmindlab/aragpt2-base) – Seeker Jan 26 '21 at 18:03
Just a guess: GPT uses a [BPE](https://huggingface.co/transformers/tokenizer_summary.html#byte-pair-encoding-bpe) tokenizer. `'ĠÙĥØªÙĬØ±'` is the byte representation of one token. For example, `t.tokenize('اَللُّغَةُ اَلْعَرَبِيَّة')` produces 37 tokens and is properly converted back with `t.decode(t.encode('اَللُّغَةُ اَلْعَرَبِيَّة'))`. Also when I check their vocab, it doesn't seem that Arab has tokens like Latin languages where you can still identify the word. Maybe that is caused by Unicode or is it some language-specific I am unaware of? – cronoik Jan 27 '21 at 14:20

Why using GPT2Tokenizer convert Arabic characters to symbols?

0 Answers0