Should I adjust the weights of embedding of newly added tokens?

Question

I'm a beginner of neural language processing. Recenttly, I try to train a text generation model based on GPT-2 with huggingface transformers. I added some new tokens to the tokenizer and resize the embedding of the model with model.resize_token_embeddings(len(tokenizer)). Suppose I added 6 new tokens, should I add the weights of the 6 tokens to the optimizer? How should I do it? Thank you very much!

score 3 · Answer 1 · answered Jul 14 '22 at 10:56

Just call the resize_token_embeddings function:

gpt2_tokenizer = AutoTokenizer.from_pretrained('gpt2')

ATTR_TO_SPECIAL_TOKEN = {'additional_special_tokens': ['SPEC1', 'SPEC2']}

orig_num_tokens = len(gpt2_tokenizer)

num_added_tokens = gpt2_tokenizer.add_special_tokens(ATTR_TO_SPECIAL_TOKEN)  # doesn't add if they are already there
if num_added_tokens > 0:
     gpt2_model.resize_token_embeddings(new_num_tokens=orig_num_tokens + num_added_tokens)

Should I adjust the weights of embedding of newly added tokens?

1 Answers1