I am trying to train a dialog system using GPT2. For tokenization, I am using the following configuration for adding the special tokens.
from transformers import (
AdamW,
AutoConfig,
AutoTokenizer,
PreTrainedModel,
PreTrainedTokenizer,
get_linear_schedule_with_warmup,
)
SPECIAL_TOKENS = {
"bos_token": "<|endoftext|>",
"eos_token": "<|endoftext|>",
"pad_token": "[PAD]",
"additional_special_tokens": ["[SYS]", "[USR]", "[KG]", "[SUB]", "[PRED]", "[OBJ]", "[TRIPLE]", "[SEP]", "[Q]","[DOM]"]
}
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
tokenizer.add_special_tokens(SPECIAL_TOKENS)
Next, when I am trying to tokenize a sequence(dialog's utterance) and later convert into ids, some of the most important tokens in my sequence are getting mapped as unknown tokens, since the ids of these important tokens becomes the same as bos and eos as they all map to <|endoftext|> as in the GPT2's source code.
Here is a working example -
tokenized_sequence = ['[PRED]', 'name', '[SUB]', 'frankie_and_bennys', '[PRED]', 'address', '[SUB]', 'cambridge_leisure_park_clifton_way_cherry_hinton', '[PRED]', 'area', '[SUB]', 'south', '[PRED]', 'food', '[SUB]', 'italian', '[PRED]', 'phone', '[SUB]', '01223_412430', '[PRED]', 'pricerange', '[SUB]', 'expensive', '[PRED]', 'postcode', '[SUB]', 'cb17dy']
important_tokens = ['frankie_and_bennys','cambridge_leisure_park_clifton_way_cherry_hinton','italian','postcode', 'cb17dy']
tokens_to_ids = [50262, 3672, 50261, 50256, 50262, 21975, 50261, 50256, 50262, 20337, 50261, 35782, 50262, 19425, 50261, 50256, 50262, 4862, 50261, 50256, 50262, 50256, 50261, 22031, 50262, 50256, 50261, 50256]
ids_to_tokens = [PRED]name[SUB]<|endoftext|>[PRED]address[SUB]<|endoftext|>[PRED]area[SUB]south[PRED]food[SUB]<|endoftext|>[PRED]phone[SUB]<|endoftext|>[PRED]<|endoftext|>[SUB]expensive[PRED]<|endoftext|>[SUB]<|endoftext|>
As you can see the important_tokens are being mapped to the id 50256 (that is to |endoftext|), the model fails to see and learn these important tokens and hence generate very poor and often hallucinated responses.
What could be a quick and efficient fix for this issue?