0

The idea of using BertTokenizer from huggingface really confuses me.

  1. When I use

    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
    tokenizer.encode_plus("Hello")
    

Does the result is somewhat similar to when I pass a one-hot vector representing "Hello" to a learning embedding matrix?

  1. How is

    BertTokenizer.from_pretrained("bert-base-uncased") 
    

different from

BertTokenizer.from_pretrained("bert-**large**-uncased") 

and other pretrained?

desertnaut
  • 57,590
  • 26
  • 140
  • 166

1 Answers1

0

The encode_plus and encode functions tokenize your texts and prepare them in a proper input format of the BERT model. Therefore you can see them similar to the one-hot vector in your provided example.
The encode_plus returns a BatchEncoding consisting of input_ids, token_type_ids, and attention_mask.

The pre-trained model differs based on the number of encoder layers. The base model has 12 encoders, and the large model has 24 layers of encoders.

Parsa Abbasi
  • 140
  • 8