1

Suppose I have the following text

aim = 'Hello world! you are a wonderful place to be in.'

I want to use GPT2 to produce the input_ids and then produce the embedding and from embeddings recover the input_ids, to do this I do:

from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2Model.from_pretrained("gpt2")

The input_ids can be defines as:

input_ids = tokenizer(aim)['input_ids']
#output: [15496, 995, 0, 345, 389, 257, 7932, 1295, 284, 307, 287, 13]

I can decode this to make sure it reproduce the aim:

tokenizer.decode(input_id)
#output: 'Hello world! you are a wonderful place to be in.'

as expected! To produce the embedding I convert the input_ids to tensor:

input_ids_tensor = torch.tensor([input_ids])

I can then procude my embeddings as:

# Generate the embeddings for input IDs 
with torch.no_grad():
    model_output = model(input_ids_tensor)
    last_hidden_states = model_output.last_hidden_state
    
# Extract the embeddings for the input IDs from the last hidden layer
input_embeddings = last_hidden_states[0,1:-1,:]

Now as mentioned earlier, the aim is to use input_embeddings and recover the input_ids, so I do:

x = torch.unsqueeze(input_embeddings, 1) # to make the dim acceptable
with torch.no_grad():
    text = model(x.long())
    decoded_text = tokenizer.decode(text[0].argmax(dim=-1).tolist())

But doing this I get:

IndexError: index out of range in self

at the level of text = model(x.long()) I wonder what am I doing wrong? How can I recover the input_ids using the embedding I produced?

Wiliam
  • 1,078
  • 10
  • 21

1 Answers1

1

You should use GPT2LMHeadModel instead of GPT2Model, because GPT2Model doesn't have a prediction head.

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Instantiate the model and tokenizer
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Set the input text
text = "Hello, how are you?"

# Tokenize the input text
input_ids = tokenizer.encode(text, return_tensors='pt')

# Use the model's forward function to obtain logits
logits = model(input_ids).logits

# Obtain the predicted token IDs by getting the argmax of the logits along the token dimension
predicted_token_ids = torch.argmax(logits, dim=-1)

# Decode the predicted token IDs back to text
output_text = tokenizer.decode(predicted_token_ids[0], skip_special_tokens=True)

# Print the output text and token IDs
print("Output text: ", output_text)
print("Output token IDs: ", predicted_token_ids.tolist())

Output:

Output text:  , I about you doing

Output token IDs:  [[11, 314, 546, 345, 1804, 198]]

The output text seems weird because they are only predicting next token at step t given tokens from step 1 to step t_1. For example,

Hello => ,
Hello, => I
Hello, how => about

To generate text step by step, you should use generate function. https://huggingface.co/docs/transformers/main_classes/text_generation

joe32140
  • 1,154
  • 8
  • 12
  • but here you are passing input_ids into the model. What I am trying to do here is to recover input_ids from the input_embeddings. and it seems GPTModel does take a parameter called input_embeds https://huggingface.co/docs/transformers/model_doc/gpt2 – Wiliam Feb 23 '23 at 23:28
  • That means you can provide input_embeds as input where the embeddings actually come from model.wte(input_ids). There's no easy way to convert input embedding back to input ids. The cloest thing I can think of is to compute similarity score between a input embedding and each of the embedding in the embedding table of the model. Then, you can see the one with highest similarity as the id you want. – joe32140 Feb 24 '23 at 01:53
  • See here for model details of converting input ids to input embedding inside the forward function. https://github.com/huggingface/transformers/blob/633062639bfd6be15abc072aaf7e18bce355f426/src/transformers/models/gpt2/modeling_gpt2.py#L841 – joe32140 Feb 24 '23 at 01:54