fine tune causal language model using transformers and pytorch

Question

I have some questions about fine-tuning causal language model using transformers and PyTorch.

My main goal is to fine-tune XLNet. However, I found the most of posts online was targeting at text classification, like this post. I was wondering, is there any way to fine-tune the model, without using the run_language_model.py from transformers' GitHub?

Here is a piece of my code trying to fine-tune XLNet:

model = XLNetLMHeadModel.from_pretrained("xlnet-base-cased")
tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased", do_lower_case=True)
LOSS = torch.nn.CrossEntrypoLoss()
batch_texts = ["this is sentence 1", "i have another sentence like this", "the final sentence"]
encodings = tokenizer.encode_plus(batch_texts, add_special_tokens=True,
                                  return_tensors=True, return_attention_mask=True)
outputs = model(encodings["input_ids"], encodings["attention_mask"])
loss = LOSS(outputs[0], target_ids)
loss.backward()
# ignoring the rest of codes...

I got stuck at the last two lines. At first, when using this LM model, it seems I don't have any labels as the supervised learning usually do; Second, as the language model which is to minimize the loss (cross-entropy here), I need a target_ids to compute the loss and perplexity with input_ids.

Here are my follow-up questions:

How should I deal with this labels during the model fitting?
Should I set something like target_ids=encodings["input_ids"].copy() to compute cross-entropy loss and perplexity?
If not, how should set this target_ids?
From the perplexity page from transformers' documentation, how should I adapt its method for non-fixed length of input text?
I saw another post from the documentation saying that it requires padding text for causal language modeling. However, from the link in 3), there is no such sign for padding text. Which one should I follow?

Any suggestions and advice will be appreciated!

Please define at first what you want to achieve. Finetuning is a meaningless term when you don't have a target because you finetune a model in a certain direction. In other words what kind of output do you expect from a tuned model when you give it a certain input. — cronoik, Aug 26 '20 at 22:37
@cronoik My goal is to fine tune the model to minimize the perplexity of input text — BigD, Aug 27 '20 at 00:04
What does that mean? Do you want to classify or do you want to summarize your text. Please give me an example (add this directly to your question). — cronoik, Aug 27 '20 at 00:06
@cronoik I was trying to fine-tune the causal language model. Technically speaking, the targets for this task is the ```input_ids``` from the tokenizer, not binary labels like 0 or 1. — BigD, Sep 15 '20 at 21:54
In case you are looking for my help, it would be great if you could simply answer my question. I haven't asked you anything technical, but I asked you to clarify the overall objective. I can read your code by myself. So just for a moment imagine that your model is working as expected, what is the output of your model for an example input (please add the example input and expected output directly to your question). — cronoik, Sep 16 '20 at 16:37

score 1 · Answer 1 · answered Jan 16 '22 at 15:04

When fine-tuning a model with a language-model head, the labels are the next tokens themselves (you predict the next words). Huggingface's library makes a lot of things very easy to do by hiding most of the complexity of the process within their methods, which is very nice when you want to do something standard. But if you want to do something special, or if you want to learn and understand the details, I suggest to go down implementing the training loop directly in pytorch; coding the low-level stuff is the best way to learn.

For this case, here are a draft to get started; the training loop is far from being complete, but it must be adapted to each specific case anyway, so I hope these few lines may help to start...

model = GPT2LMHeadModel.from_pretrained('distilgpt2')
tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2')
# our input:
s = tokenizer.encode('In winter, the weather is',return_tensors='pt')
# we want to fine-tune to force a fake output as follows:
ss = tokenizer.encode('warm and hot',return_tensors='pt')
# forward pass:
outputs = model(s)
# check that the outout logits are given for every input token:
print(outputs.logits.size())
# we're gonna train on the token that follows the last input one
# so we extract just the last logit:
lasty = outputs.logits[0,-1].view(1,-1)
# prepare backprop:
lossfct = torch.nn.CrossEntropyLoss()
optimizer = transformers.AdamW(model.parameters(), lr=5e-5)
# just take the first next token (you should repeat this for the next ones)
labels = ss[0][0].view(1)
loss = lossfct(lasty,labels)
loss.backward()
optimizer.step()
optimizer.zero_grad()

# finetunening done: you may check the answer is already different:
y = model.generate(s)
sy = tokenizer.decode(y[0])
print(sy)

fine tune causal language model using transformers and pytorch

1 Answers1