0

I am using T5 to summarize multiple sequences as a batch. Here I want to generate the output of model.generate(input_ids) by calling forward function (model(**inputs)). I know that forward() and generate() work completely different see this. To make them working the same way. I take some sequences and call model.generate() on them to generate the corresponding outputs and get pairs of (text, summary). Now, Calling the forward function on these pairs one each time generates the same outputs. However, when calling the forward function on batch of sequences, the output is not the same ? What I missed ?

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
model.resize_token_embeddings(len(tokenizer))
model.to("cuda")
model.eval()

# sequences
seq1 = "summarize: Calling the model (which means the forward method) uses the labels for teacher forcing. This means inputs to the decoder are the labels shifted by one"
output1 = "calling the model uses the labels for teacher forcing. inputs to the decoder"

seq2 = "summarize: When you call the generate method, the model is used in the autoregressive fashion"
output2 = "the model is used in the auto-aggressive fashion."

seq3 = "summarize: However, selecting the token is a hard decision, and the gradient cannot be propagated through this decision"
output3 = "the token is a hard decision, and the gradient cannot be propagated through this decision"

input_sequences = [seq1, seq2, seq3]
output_seq = [output1, output2, output3]

# encoding input and attention mask
encoding = tokenizer(
    input_sequences,
    padding="longest",
    max_length=128,
    truncation=True,
    return_tensors="pt",
)

input_ids, attention_mask = encoding.input_ids.to("cuda"), encoding.attention_mask.to("cuda")

# labels
target_encoding = tokenizer(
    output_seq, padding="longest", max_length=128, truncation=True
)
labels = target_encoding.input_ids
labels = torch.tensor(labels).to("cuda")
labels[labels == tokenizer.pad_token_id] = -100

# Call the models
logits = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels).logits

# Apply softamx() and batch_decode()

X = logits
X = F.softmax(X, dim=-1)
ids = X.argmax(dim=-1)
y = tokenizer.batch_decode(sequences=ids, skip_special_tokens=True)

# results: batch_size=3

['call the model uses the labels for teacher forcing  inputs to the decoder are',
 'the model is used in the auto-aggressive fashion  the the the',
 'the token is a hard decision, and the gradient cannot be propagated through this decision ']

# results: batch_size =1 i.e. consider 1 seq each time

['call the model uses the labels for teacher forcing  inputs to the decoder are']

['the model is used in the auto-aggressive fashion ']

['the token is a hard decision, and the gradient cannot be propagated through this decision ']
LearnToGrow
  • 1,656
  • 6
  • 30
  • 53
  • When you feed only one sequence as input then the transformer performs attention on that sequence only, But when we input a batch of sequences (with uneven length) then attention is performed over tokens also which could be the reason behind extra `the` in the second sequence – Karan Dhingra May 09 '22 at 21:29
  • @KaranDhingra, thanks but even with 4 sequences, it is still the sam problem – LearnToGrow May 10 '22 at 00:50
  • It does not depend on number of input sequences but the tag present. Please try it again but truncate it till the length of second sequence then your output should match of you just input second sequence – Karan Dhingra May 10 '22 at 04:06
  • @KaranDhingra, normally this shoul be fixed by mask_attention because we pad the short sequence and we mask the pad token. Based on your thought we should train with equal length sequences! – LearnToGrow May 10 '22 at 04:26
  • no no, you are right. I completely forgot about the mask attention. Lemme try replicating this with pretrained transformer because the output should be same – Karan Dhingra May 10 '22 at 09:42

0 Answers0