My input is a string and the outputs are vector representations (corresponding to the generated tokens). I'm trying to force the outputs to have specific tokens (e.g., 4 commas/2 of the word "to", etc). That is, each generated sentence must have those.
Is there a potential loss component that can force GPT2 to generate specific tokens? Another approach that will be easier and more robust (but I'm not sure is possible), is similar to the masking of tokens in BERT. That is, instead of forcing GPT2 to generate sentences with unique tokens, to have the predefined tokens in the sentence beforehand:
[MASK][MASK][specific_token][MASK][MASK][specific_token]
However, an issue with this approach is that there isn't a predefined number of tokens that should be generated/masked before or after the [specific_token]
, nor there is a predefined number of sentences to generate for each given input (else I would have used BERT).
Code:
from transformers import logging
from transformers import GPT2Tokenizer, GPT2Model
import torch
checkpoint = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(checkpoint)
model = GPT2Model.from_pretrained(checkpoint)
num_added_tokens = tokenizer.add_special_tokens({'pad_token': '[CLS]'})
embedding_layer = model.resize_token_embeddings(len(tokenizer)) # Update the model embeddings with the new vocabulary size
input_string = 'Architecturally, the school has a Catholic character.'
token_ids = tokenizer(input_string, truncation = True, padding=True)
output = model(torch.tensor(token_ids['input_ids']))