1

i'm using huggingface transformers package to load a pretrained GPT-2 model. I want to use GPT-2 for text generation, but the pretrained version isn't enough so I want to fine tune it with a bunch of personal text data.

i'm not sure how I should prepare my data and train the model. I have tokenized the text data I have to train GPT-2 on, but i'm not sure what the "labels" will be for text generation since this isn't a classification problem.

How do I train GPT-2 on this data using Keras API?

my model:

modelName = "gpt2"
generator = pipeline('text-generation', model=modelName)

my tokenizer:

tokenizer = AutoTokenizer.from_pretrained(modelName)

my tokenized dataset:

from datasets import Dataset
def tokenize_function(examples):
    return tokenizer(examples['dataset']) # 'dataset' column contains a string of text. Each row is a string of text (in sequence)
dataset = Dataset.from_pandas(conversation)
tokenized_dataset = dataset.map(tokenize_function, batched=False)
print(tokenized_dataset)

How should I use this tokenized dataset to fine tune my GPT-2 model?

ParmuTownley
  • 957
  • 2
  • 14
  • 34
  • The colab notebook from this blog might be helpful: https://reyfarhan.com/posts/easy-gpt2-finetuning-huggingface/ – druskacik Dec 07 '22 at 10:02
  • Hello, I am looking for fine-tuning the GPT-2 model for the question answering, or say "generative question answering". Meaning, I train the GPT-2 with a large corpus of data for some specific industry (say medical) and then I start asking questions. If possible, will you please direct me toward that? Thanks – Aayush Shah Mar 06 '23 at 12:18

2 Answers2

1

This is my attempt

"""
Datafile is a text file with one sentence per line _DATASETS/data.txt
tf_gpt2_keras_lora is the name of the fine-tuned model
"""

import tensorflow as tf
from transformers import GPT2Tokenizer, TFGPT2LMHeadModel
from transformers.modeling_tf_utils import get_initializer
import os

# use 2 cores
tf.config.threading.set_intra_op_parallelism_threads(2)
tf.config.threading.set_inter_op_parallelism_threads(2)

# Use pretrained model if it exists
# otherwise download it
if os.path.exists("tf_gpt2_keras_lora"):
    print("Model exists")
    # use pretrained model
    model = TFGPT2LMHeadModel.from_pretrained("tf_gpt2_keras_lora")
else:
    print("Downloading model")
    model = TFGPT2LMHeadModel.from_pretrained("gpt2")

# Load the tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Load and preprocess the data
with open("_DATASETS/data.txt", "r") as f:
    lines = f.read().split("\n")

# Encode the data using the tokenizer and truncate the sequences to a maximum length of 1024 tokens
input_ids = []
for line in lines:
    encoding = tokenizer.encode(line, add_special_tokens=True, max_length=1024, truncation=True)
    input_ids.append(encoding)

# Define some params
batch_size = 2
num_epochs = 3
learning_rate = 5e-5

# Define the optimizer and loss function
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Fine-tune the model using low-rank adaptation and attention pruning
for layer in model.transformer.h:
    layer.attention_output_dense = tf.keras.layers.Dense(units=256, kernel_initializer=get_initializer(0.02), name="attention_output_dense")
    
model.summary()

# Train the model
for epoch in range(num_epochs):
    print(f"Epoch {epoch + 1}/{num_epochs}")
    
    # Shuffle the input data
    #input_ids = tf.random.shuffle(input_ids)
    
    for i in range(0, len(input_ids), batch_size):
        batch = input_ids[i:i+batch_size]
        # Pad the batch to the same length
        batch = tf.keras.preprocessing.sequence.pad_sequences(batch, padding="post")
        # Define the inputs and targets
        inputs = batch[:, :-1]
        targets = batch[:, 1:]
        # Compute the predictions and loss
        with tf.GradientTape() as tape:
            logits = model(inputs)[0]
            loss = loss_fn(targets, logits)
        # Compute the gradients and update the parameters
        gradients = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))
        
        # Print the loss every 10 batches
        if i % (10 * batch_size) == 0:
            print(f"Batch {i}/{len(input_ids)} - loss: {loss:.4f}")
            
# Save the fine-tuned model
model.save_pretrained("tf_gpt2_keras_lora")

# Generate text using the fine-tuned model
input_ids = tokenizer.encode("How much wood", return_tensors="tf")
output = model.generate(input_ids, max_length=100, do_sample=True, top_k=50, top_p=0.95, temperature=0.9)
print(tokenizer.decode(output[0], skip_special_tokens=True))
PlugTrade
  • 837
  • 11
  • 19
0

I would recommend to take a look at this example provided by HuggingFace, which shows how to fine-tune a TensorFlow model for causal language modeling (i.e. text generation): https://github.com/huggingface/transformers/blob/main/examples/tensorflow/language-modeling/run_clm.py

Regarding the specific question around how to represent "labels", HuggingFace transformers models allows you to pass in a labels parameter when executing your model. The value of this parameter should be the same as the tokenized input_ids, as described in the transformers docs (https://huggingface.co/docs/transformers/v4.30.0/en/model_doc/gpt2#transformers.GPT2LMHeadModel.forward.labels):

  • labels (torch.LongTensor of shape (batch_size, sequence_length), optional) — Labels for language modeling. Note that the labels are shifted inside the model, i.e. you can set labels = input_ids Indices are selected in [-100, 0, ..., config.vocab_size] All labels set to -100 are ignored (masked), the loss is only computed for labels in [0, ..., config.vocab_size]

One other important thing to note in the run_clm.py script shared above: once you have prepared your tokenized dataset with input_ids and labels columns, you will need to convert it into a TensorFlow dataset object so that it can be used with model.fit(). This is done with the prepare_tf_dataset() function as shown here: https://github.com/huggingface/transformers/blob/main/examples/tensorflow/language-modeling/run_clm.py#L505-L516

George Novack
  • 196
  • 1
  • 2