How to fine-tune GPT2 text generation using Huggingface trainer API?

Question

I'm farily new to machine learning, and am trying to figure out the Huggingface trainer API and their transformer library. My end use-case is to fine-tune a model like GODEL (or anything better than DialoGPT, really, which I managed to get working already by copy-pasting someone else's custom training loop) on a custom dataset, which I think can be accomplished with the trainer API (please correct me if I'm wrong). But before that I figured I'd try to get a basic toy example working by fine-tuning GPT-2 on a Huggingface dataset.

However, modifying the tutorial code (which fine-tunes BERT for text classification, link here) to instead generate text leads to the following error when trainer.train() is called: ValueError: The model did not return a loss from the inputs, only the following keys: logits,past_key_values. For reference, the inputs it received are input_ids,attention_mask.

Here is my complete python code. I'm not sure if the problem is being caused by not setting the metric correctly, by not using the right arguments to Trainer(), by compute_metrics, or by something else entirely.

from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
import numpy as np
import evaluate
from transformers import TrainingArguments, Trainer
import torch
from pynvml import *


training_args = TrainingArguments(output_dir='test_trainer', 
                                  evaluation_strategy='epoch',
                                  per_device_train_batch_size=1,
                                  per_device_eval_batch_size=1,
                                  gradient_accumulation_steps=20, # I'm paranoid about memory
                                  num_train_epochs = 2,
                                  fp16=False,)

# Load model and specify number of labels that text can be classified as
model = AutoModelForCausalLM.from_pretrained("gpt2").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
metric = evaluate.load("accuracy")


tokenizer.pad_token = tokenizer.eos_token

# This might be where the problem is, but I'm not sure how to write it for straight-up simple text generation
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1) # Get largest logit / the prediction
    return metric.compute(predictions=predictions
                          )


def tokenize_function(examples):
    return tokenizer(examples["text"],
                    #  padding="max_length",
                    padding=True,
                     truncation=True)


# Load a smaller, sampled dataset instead of the full one so that it doesn't take 10 years to run this code each time
dataset = load_dataset('codyburker/yelp_review_sampled')

dataset = dataset.map(tokenize_function,batched=True,
                        batch_size=1)

# Debug print statements to see what each object looks like
# print(dataset["train"][100])
# print(dataset["test"][100])

small_train_dataset = dataset["train"].shuffle(seed=42).select(range(1000))#["input_ids"]#.map(tokenize_function,batched=True)
small_eval_dataset = dataset["test"].shuffle(seed=42).select(range(1000))#["input_ids"]#.map(tokenize_function,batched=True)


trainer = Trainer(
    model=model,
    args = training_args,
    train_dataset=small_train_dataset,
    # eval_dataset=small_eval_dataset,
    # compute_metrics=compute_metrics, #commented out because the compute_metrics is unchanged from the original text classification code
)

trainer.train()

Any help appreciated. I'm 90% sure I'm missing a basic step in using the API, since I'm very new to this.

Check out https://huggingface.co/togethercomputer/GPT-NeoXT-Chat-Base-20B (right button `train` should give you some template code) — alvas, Mar 13 '23 at 00:44

How to fine-tune GPT2 text generation using Huggingface trainer API?

0 Answers0