padding and attention mask does not work as intended in batch input in GPT language model

Question

The following code is without batch:

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()
context=torch.tensor([tokenizer.encode("This is")])
output, past = model(context)
token = torch.argmax(output[..., -1, :])
print(tokenizer.decode(token.item()))

output: ' a'

This is working fine. Now, I extended this to batch setting:

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()

context=[torch.tensor(tokenizer.encode("This is ")),torch.tensor(tokenizer.encode("Hello How are "))]
context=pad_sequence(context,batch_first=True)

mask=torch.tensor([[1,1,0],[1,1,1]])
output, past = model(context,attention_mask=mask)
token = torch.argmax(output[..., -1, :],dim=1)
tokenizer.decode(token)

output: '\n you'

Here \n is next token for the first context and you is next token for second context of the batch. But The expected next token for the first context is a, since all the settings are same. Furthermore, if you reduce the second context to 2 token you will get a in this batch setting. So clearly, model can not understand the padding. Also, the attention mask does not work. Because, after padding the next token of sequence this is is 0 (zero). And according to the attention mask ([1,1,0]), this zero should be avoided and only the tokens this and is should be attended. The proofs that this attention masking is not working are:

Use attention mask [1,1,1], that means attend even on the padding zero, you get the same output which is \n.
Use the the string this is!. Here ! has the zero index in the vocabulary matrix. Again you get the same output which is \n.

Only time, it is possible to get desirable output is without the batch settings and attention mask ( now it seems, it does not matter because it has no effect anyway)

Then I found this, which suggested to use pad_token. So I used like following:

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
from torch.nn.utils.rnn import pad_sequence  

tokenizer = GPT2Tokenizer.from_pretrained("gpt2",pad_token="<PAD>")
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()

context=[torch.tensor(tokenizer.encode("This is <PAD> ")),torch.tensor(tokenizer.encode("Hello How are"))]
context=torch.stack(context)
print(context)
mask=torch.tensor([[1,1,0],[1,1,1]])

output, past = model(context,attention_mask=mask)
token = torch.argmax(output[..., -1, :],dim=1)
tokenizer.decode(token)

output: 'The you'

Here The is next token for the first context and you is next token for second context of the batch. This is also not working. Because The is not expected for the first context.

How do I use variable length sequence in batch setting in gpt/gpt2 model?

score 1 · Answer 1 · answered Mar 10 '20 at 05:47

I'm not sure if this helps, but you don't need to implement you own attention masking and padding. The Transformers library provides the encode_plus() and batch_encode_plus() functions that will perform tokenization, generate the attention masks, and do padding for you. The results come out as Python dictionaries.

padding and attention mask does not work as intended in batch input in GPT language model

1 Answers1