Not enough memory for fine tuning LLM with Hugging Face

Question

I'm running into runtime errors where I don't have enough memory to fine tune a pretrained LLM.

I'm a novelist and I am curious to see what would happen if I fine tune a pretrained LLM to write more chapters of my novel in my style.

I successfully ran a tutorial on fine tuning a BERT model with Hugging Face with a Yelp dataset that is smaller than mine yesterday on my CPU (I have 16GB RAM and don't have an NVIDIA GPU,) so not sure where the error is arising from now.

Some things I've tried, but still giving me a runtime memory error:

changed my model from Neo GPT to GPT2, which is much smaller
decreased my batch size hyperparameter
decreased the max length of tokens
decreased my dataset size

This is my code:

from transformers import GPTNeoForCausalLM, GPT2Tokenizer, Trainer, TrainingArguments
from datasets import Dataset, load_dataset

# Step 1: Import my novel
import docx
import pandas as pd

# Read each paragraph from a Word file
doc = docx.Document(r"C:\Users\chris\Downloads\The Black Squirrel (1).docx")
paras = [p.text for p in doc.paragraphs if p.text]

# Convert list to dataframe
df = pd.DataFrame(paras)
df.reset_index(drop=False,inplace=True)
df.rename(columns={'index':'label',0:'text'},inplace=True)

# Split my novel into train and test
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.05)

# Export novel as CSV to be read by Huggingface library
train.to_csv(r"C:\Users\chris\OneDrive\Documents\ML\data\black_squirrel_dataset_train.csv", index=False)
test.to_csv(r"C:\Users\chris\OneDrive\Documents\ML\data\black_squirrel_dataset_test.csv", index=False)

# Tokenize novel
datasets = load_dataset('csv',
                       data_files={'train':r"C:\Users\chris\OneDrive\Documents\ML\data\black_squirrel_dataset_train.csv",
                       'test':r"C:\Users\chris\OneDrive\Documents\ML\data\black_squirrel_dataset_test.csv"})

# Instantiate tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B",
                                          pad_token='[PAD]')

# Do I need the below?
# tokenizer.enable_padding(pad_id=tokenizer.token_to_id('[PAD]'))
paragraphs = df['text']
max_length = max([len(tokenizer.encode(paragraphs)) for paragraphs in paragraphs])

# Tokenize my novel
def tokenize_function(examples):
    return tokenizer(examples["text"], padding='max_length', truncation=True)

tokenized_datasets = datasets.map(tokenize_function, batched=True)

# Step 2: Train the model
model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B")

model.resize_token_embeddings(len(tokenizer))

training_args = TrainingArguments(
    output_dir=r"C:\Users\chris\OneDrive\Documents\ML\models",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=32, # batch size for training
    per_device_eval_batch_size=64,  # batch size for evaluation
    eval_steps = 400, # Number of update steps between two evaluations.
    save_steps=800, # after # steps model is saved
    warmup_steps=500,# number of warmup steps for learning rate scheduler
    )

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test']
)

trainer.train()

Here is my error readout:

***** Running training *****
  Num examples = 779
  Num Epochs = 3
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 75
  Number of trainable parameters = 1315577856
  0%|          | 0/75 [19:12<?, ?it/s]
Traceback (most recent call last):
  File "C:\Users\chris\AppData\Local\Programs\Python\Python37\lib\code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 9, in <module>
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\trainer.py", line 1547, in train
    ignore_keys_for_eval=ignore_keys_for_eval,
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\trainer.py", line 1791, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\trainer.py", line 2539, in training_step
    loss = self.compute_loss(model, inputs)
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\trainer.py", line 2571, in compute_loss
    outputs = model(**inputs)
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\models\gpt_neo\modeling_gpt_neo.py", line 752, in forward
    return_dict=return_dict,
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\models\gpt_neo\modeling_gpt_neo.py", line 627, in forward
    output_attentions=output_attentions,
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\models\gpt_neo\modeling_gpt_neo.py", line 342, in forward
    feed_forward_hidden_states = self.mlp(hidden_states)
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\models\gpt_neo\modeling_gpt_neo.py", line 300, in forward
    hidden_states = self.act(hidden_states)
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\activations.py", line 35, in forward
    return 0.5 * input * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (input + 0.044715 * torch.pow(input, 3.0))))
RuntimeError: [enforce fail at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 405798912 bytes.

My novel is here. You can save as docx as is and run the code. Or, you can just save the first chapter. I also tried splitting up the first chapter into one paragraph per sentence to make the tokens even smaller, though that didn't help.

Does this indicate that I really need an NVIDIA GPU to run machine learning tasks? Or is this likely an issue with my dataset setup or code?

Thanks.

Not enough memory for fine tuning LLM with Hugging Face

0 Answers0