2

I am developing a code to use the pre-trained GPT2 model for a machine translation task. The length of my data's word-to-id is 91, and I developed the following code for my model:

import torch
from torch.utils.data import DataLoader
from transformers.models.gpt2.modeling_gpt2 import GPT2Model

# data preparation code

def batch_sequences(x, y, env):
    """
    Take as input a list of n sequences (torch.LongTensor vectors) and return
    a tensor of size (slen, n) where slen is the length of the longest
    sentence, and a vector lengths containing the length of each sentence.
    """
    lengths_x = torch.LongTensor([len(s) + 2 for s in x])
    lengths_y = torch.LongTensor([len(s) + 2 for s in y])
    max_length = max(lengths_x.max().item(), lengths_y.max().item())
    sent_x = torch.LongTensor(
        max_length, lengths_x.size(0)).fill_(env.pad_index)
    sent_y = torch.LongTensor(
        max_length, lengths_y.size(0)).fill_(env.pad_index)
    assert lengths_x.min().item() > 2
    assert lengths_y.min().item() > 2

    sent_x[0] = env.eos_index
    for i, s in enumerate(x):
        sent_x[1:lengths_x[i] - 1, i].copy_(s)
        sent_x[lengths_x[i] - 1, i] = env.eos_index

    sent_y[0] = env.eos_index
    for i, s in enumerate(y):
        sent_y[1:lengths_y[i] - 1, i].copy_(s)
        sent_y[lengths_y[i] - 1, i] = env.eos_index

    return sent_x, sent_y, max_length

def collate_fn(elements):
    """
    Collate samples into a batch.
    """
    x, y = zip(*elements)
    x = [torch.LongTensor([env.word2id[w]
                          for w in seq if w in env.word2id]) for seq in x]
    y = [torch.LongTensor([env.word2id[w]
                          for w in seq if w in env.word2id]) for seq in y]
    x, y, length = batch_sequences(x, y, env)
    return (x, length), (y, length), torch.LongTensor(nb_ops)

loader = DataLoader(data, batch_size=1, shuffle=False, collate_fn=collate_fn)
gpt2 = GPT2Model.from_pretrained('gpt2')
in_layer = nn.Embedding(len(env.word2id), 768)
out_layer = nn.Linear(768, len(env.word2id))

parameters = list(gpt2.parameters()) + list(in_layer.parameters()) + list(out_layer.parameters())
optimizer = torch.optim.Adam(parameters)
loss_fn = nn.CrossEntropyLoss()
for layer in (gpt2, in_layer, out_layer):
    layer.train()

accuracies = list()
n_epochs = 5
for i in range(n_epochs):
    for (x, x_len), (y, y_len) in loader:

        x = x.to(device=device)
        y = y.to(device=device)

        embeddings = in_layer(x.reshape(1, -1))
        hidden_state = gpt2(inputs_embeds=embeddings).last_hidden_state[:, :]
        logits = out_layer(hidden_state)[0]
        loss = loss_fn(logits, y.reshape(-1))
        accuracies.append(
            (logits.argmax(dim=-1) == y.reshape(-1)).float().mean().item())

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if len(accuracies) % 500 == 0:
            accuracy = sum(accuracies[-50:]) / len(accuracies[-50:])
            print(f'Samples: {len(accuracies)}, Accuracy: {accuracy}')

This code works pretty well when the batch size is 1. But it is so slow. I wanted to increase the batch size from 1 to 32, but I get some dimension compatibility problems. How can I increase the batch size without errors?

My data consists of pair of sentences, the first one is a sentence in the first language and the second one is its translation in the second language.

For example, assume that x.shape is (batch_size, 12) (meaning we have 'batch_size' sentences of length 12 as input and y.shape is also (batch_size, 12) (the translations). And also we have a word-to-id dictionary of length 90 that matches each word in a sentence with its index)

K.N
  • 871
  • 2
  • 10
  • 30
  • Could you please share the error you got? it will be handy for someone with the same problem searching for this question. – Jindřich May 11 '21 at 07:08
  • @Jindřich The above code works correctly, because the batch size is one (look at the part 'embeddings = in_layer(x.reshape(1, -1))' in the code). My problem is I cannot increase the batch size from 1 to something more than that (like 32) and I get shape compatibility errors which show that I am not implementing it correctly. I added a paragraph at the end of my question above, about my desired shapes. – K.N May 11 '21 at 07:28

1 Answers1

0

This problem can be solved using padding. We need two special symbols:

  • code 0 in inputs (x) will denote "blank" tokens that should not be translated.
  • code -100 in outputs (y) will denote "blank" tokens that should not participate in the calculation of loss. nn.CrossEntropyLoss() is programmed to ignore this value (by the argument ignore_index).

The batch of size 3 could look like this:

x:
[[1, 2, 3, 0, 0],
[ 4, 5, 6, 7, 8],
[ 9, 8, 0, 0, 0]]
y:
[[1, 2,    3, -100, -100],
[ 4, 5,    6,    7,    8],
[ 9, 8, -100, -100, -100]]

You could generate it with code such as:

def pad_sequences(batch, pad_value=0):
    n = max(len(v) for v in batch)
    return torch.tensor([v + [pad_value] * (n - len(v)) for v in batch])

However, I feel there is an issue with your problem statement. If you perform machine translation, then your inputs and outputs can have different lengths, but your architecture only allows x and y to have the same lengths. If you want to support x and y of different lengths, I would suggest to use a seq2seq architecture such as T5 instead.

Another issue is that GPT is autoregressive, so if y is completely aligned with x, then we cannot use the suffix of x while generating the left part of y. So if you wish your x and y to be perfectly aligned, but still would like to use the full information about x when generating y, I would recommend using a bidirectional encoder such as BERT.

David Dale
  • 10,958
  • 44
  • 73
  • Thanks for your answer. I know how to create the data with batches. my problem is in the last part when I train the model. Can you tell me how should I change that part so that It can be run for every batch size? – K.N May 14 '21 at 13:05
  • Thank you david for the answer. But can you please answer this question too? https://stackoverflow.com/questions/76376524/expected-input-batch-size-28-to-match-target-batch-size-456-changing-batch – Irfan Yaqub Jun 25 '23 at 06:40