1

I am trying to use GPT2 for Arabic text classification task as follows:

    tokenizer = GPT2Tokenizer.from_pretrained(model_path)
    model = GPT2ForSequenceClassification.from_pretrained(model_path, 
                                                          num_labels=len(lab2ind)) 

However, when I use the tokenizer it converts the Arabic characters to symbols like this 'ĠÙĥتÙĬر'

Seeker
  • 31
  • 1
  • 3
  • What is `model_path`? – cronoik Jan 26 '21 at 15:05
  • @cronoik [aubmindlab/aragpt2-base](https://huggingface.co/aubmindlab/aragpt2-base) – Seeker Jan 26 '21 at 18:03
  • Just a guess: GPT uses a [BPE](https://huggingface.co/transformers/tokenizer_summary.html#byte-pair-encoding-bpe) tokenizer. `'ĠÙĥتÙĬر'` is the byte representation of one token. For example, `t.tokenize('اَللُّغَةُ اَلْعَرَبِيَّة')` produces 37 tokens and is properly converted back with `t.decode(t.encode('اَللُّغَةُ اَلْعَرَبِيَّة'))`. Also when I check their vocab, it doesn't seem that Arab has tokens like Latin languages where you can still identify the word. Maybe that is caused by Unicode or is it some language-specific I am unaware of? – cronoik Jan 27 '21 at 14:20

0 Answers0