I'm running into runtime errors where I don't have enough memory to fine tune a pretrained LLM.
I'm a novelist and I am curious to see what would happen if I fine tune a pretrained LLM to write more chapters of my novel in my style.
I successfully ran a tutorial on fine tuning a BERT model with Hugging Face with a Yelp dataset that is smaller than mine yesterday on my CPU (I have 16GB RAM and don't have an NVIDIA GPU,) so not sure where the error is arising from now.
Some things I've tried, but still giving me a runtime memory error:
- changed my model from Neo GPT to GPT2, which is much smaller
- decreased my batch size hyperparameter
- decreased the max length of tokens
- decreased my dataset size
This is my code:
from transformers import GPTNeoForCausalLM, GPT2Tokenizer, Trainer, TrainingArguments
from datasets import Dataset, load_dataset
# Step 1: Import my novel
import docx
import pandas as pd
# Read each paragraph from a Word file
doc = docx.Document(r"C:\Users\chris\Downloads\The Black Squirrel (1).docx")
paras = [p.text for p in doc.paragraphs if p.text]
# Convert list to dataframe
df = pd.DataFrame(paras)
df.reset_index(drop=False,inplace=True)
df.rename(columns={'index':'label',0:'text'},inplace=True)
# Split my novel into train and test
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.05)
# Export novel as CSV to be read by Huggingface library
train.to_csv(r"C:\Users\chris\OneDrive\Documents\ML\data\black_squirrel_dataset_train.csv", index=False)
test.to_csv(r"C:\Users\chris\OneDrive\Documents\ML\data\black_squirrel_dataset_test.csv", index=False)
# Tokenize novel
datasets = load_dataset('csv',
data_files={'train':r"C:\Users\chris\OneDrive\Documents\ML\data\black_squirrel_dataset_train.csv",
'test':r"C:\Users\chris\OneDrive\Documents\ML\data\black_squirrel_dataset_test.csv"})
# Instantiate tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B",
pad_token='[PAD]')
# Do I need the below?
# tokenizer.enable_padding(pad_id=tokenizer.token_to_id('[PAD]'))
paragraphs = df['text']
max_length = max([len(tokenizer.encode(paragraphs)) for paragraphs in paragraphs])
# Tokenize my novel
def tokenize_function(examples):
return tokenizer(examples["text"], padding='max_length', truncation=True)
tokenized_datasets = datasets.map(tokenize_function, batched=True)
# Step 2: Train the model
model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B")
model.resize_token_embeddings(len(tokenizer))
training_args = TrainingArguments(
output_dir=r"C:\Users\chris\OneDrive\Documents\ML\models",
overwrite_output_dir=True,
num_train_epochs=3,
per_device_train_batch_size=32, # batch size for training
per_device_eval_batch_size=64, # batch size for evaluation
eval_steps = 400, # Number of update steps between two evaluations.
save_steps=800, # after # steps model is saved
warmup_steps=500,# number of warmup steps for learning rate scheduler
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets['train'],
eval_dataset=tokenized_datasets['test']
)
trainer.train()
Here is my error readout:
***** Running training *****
Num examples = 779
Num Epochs = 3
Instantaneous batch size per device = 32
Total train batch size (w. parallel, distributed & accumulation) = 32
Gradient Accumulation steps = 1
Total optimization steps = 75
Number of trainable parameters = 1315577856
0%| | 0/75 [19:12<?, ?it/s]
Traceback (most recent call last):
File "C:\Users\chris\AppData\Local\Programs\Python\Python37\lib\code.py", line 90, in runcode
exec(code, self.locals)
File "<input>", line 9, in <module>
File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\trainer.py", line 1547, in train
ignore_keys_for_eval=ignore_keys_for_eval,
File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\trainer.py", line 1791, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\trainer.py", line 2539, in training_step
loss = self.compute_loss(model, inputs)
File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\trainer.py", line 2571, in compute_loss
outputs = model(**inputs)
File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\models\gpt_neo\modeling_gpt_neo.py", line 752, in forward
return_dict=return_dict,
File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\models\gpt_neo\modeling_gpt_neo.py", line 627, in forward
output_attentions=output_attentions,
File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\models\gpt_neo\modeling_gpt_neo.py", line 342, in forward
feed_forward_hidden_states = self.mlp(hidden_states)
File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\models\gpt_neo\modeling_gpt_neo.py", line 300, in forward
hidden_states = self.act(hidden_states)
File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\activations.py", line 35, in forward
return 0.5 * input * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (input + 0.044715 * torch.pow(input, 3.0))))
RuntimeError: [enforce fail at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 405798912 bytes.
My novel is here. You can save as docx as is and run the code. Or, you can just save the first chapter. I also tried splitting up the first chapter into one paragraph per sentence to make the tokens even smaller, though that didn't help.
Does this indicate that I really need an NVIDIA GPU to run machine learning tasks? Or is this likely an issue with my dataset setup or code?
Thanks.