I'm working with the T5 model from the Hugging Face Transformers library and I have an input sequence with masked tokens that I want to replace with the output generated by the model. Here's the code.
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")
input_data = "The <extra_id_0> walks in <extra_id_1> park"
input_ids = tokenizer(input_data, return_tensors="pt").input_ids
sequence_ids = model.generate(input_ids)
output_sequences = tokenizer.batch_decode(sequence_ids)
output_sequences
This code produces the following output:
['<pad><extra_id_0> park offers<extra_id_1> the<extra_id_2> park.</s>']
What I want to do is replace the masked tokens <extra_id_0>
and <extra_id_1>
in the input sequence with the corresponding output tokens from the model, so that the final output is:
The park offers walks in the park.
I'm hoping someone can help me with the code to achieve this.
Notice that this is the correspondence:
mask in input_data -> answer in output_sequences
<extra_id_0> -> <extra_id_0> park offers (so we extract 'park offers' only)
<extra_id_1> -> <extra_id_1> the (so we extract 'the' only)