The following code is without batch:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()
context=torch.tensor([tokenizer.encode("This is")])
output, past = model(context)
token = torch.argmax(output[..., -1, :])
print(tokenizer.decode(token.item()))
output: ' a'
This is working fine. Now, I extended this to batch setting:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()
context=[torch.tensor(tokenizer.encode("This is ")),torch.tensor(tokenizer.encode("Hello How are "))]
context=pad_sequence(context,batch_first=True)
mask=torch.tensor([[1,1,0],[1,1,1]])
output, past = model(context,attention_mask=mask)
token = torch.argmax(output[..., -1, :],dim=1)
tokenizer.decode(token)
output: '\n you'
Here \n
is next token for the first context and you
is next token for second context of the batch.
But The expected next token for the first context is a
, since all the settings are same. Furthermore, if you reduce the second context to 2 token you will get a
in this batch setting. So clearly, model can not understand the padding.
Also, the attention mask does not work. Because,
after padding the next token of sequence this is
is 0 (zero). And according to the attention mask ([1,1,0]
), this zero should be avoided and only the tokens this
and is
should be attended. The proofs that this attention masking is not working are:
Use attention mask [1,1,1], that means attend even on the padding zero, you get the same output which is
\n
.Use the the string
this is!
. Here!
has the zero index in the vocabulary matrix. Again you get the same output which is\n
.
Only time, it is possible to get desirable output is without the batch settings and attention mask ( now it seems, it does not matter because it has no effect anyway)
Then I found this, which suggested to use pad_token. So I used like following:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
from torch.nn.utils.rnn import pad_sequence
tokenizer = GPT2Tokenizer.from_pretrained("gpt2",pad_token="<PAD>")
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()
context=[torch.tensor(tokenizer.encode("This is <PAD> ")),torch.tensor(tokenizer.encode("Hello How are"))]
context=torch.stack(context)
print(context)
mask=torch.tensor([[1,1,0],[1,1,1]])
output, past = model(context,attention_mask=mask)
token = torch.argmax(output[..., -1, :],dim=1)
tokenizer.decode(token)
output: 'The you'
Here The
is next token for the first context and you
is next token for second context of the batch. This is also not working. Because The
is not expected for the first context.
How do I use variable length sequence in batch setting in gpt/gpt2 model?