I am trying to fine tune a hugging face model onto a Shell Code dataset (https://huggingface.co/datasets/SoLID/shellcode_i_a32)
The training code is a basic hugging face trainer method but we keep running into nan/inf issues
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast(tokenizer_file="tkn1.json", padding_side="right")
special_tokens={'pad_token': "[PAD]"}
tokenizer.add_special_tokens(special_tokens)
# token_wrap = PreTrainedTokenizer()
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
training_args = Seq2SeqTrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
lr_scheduler_type = "cosine",
weight_decay=0.01,
save_total_limit=3,
per_device_train_batch_size=128,
num_train_epochs=5,
warmup_ratio=0.06,
learning_rate=1.0e-04,
# fp16=True,
debug=["underflow_overflow"]
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["test"],
eval_dataset=tokenized_datasets["test"],
tokenizer=tokenizer,
data_collator=data_collator,
)
# trainer.train()
# print(tokenizer.)
trainer.train()
# eval_loss = trainer.evaluate()
# print(f">>> Perplexity: {math.exp(eval_loss['eval_loss']):.2f}")
The outputs look like -
You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Detected inf/nan during batch_number=0
Last 1 forward frames:
abs min abs max metadata
shared Embedding
5.42e-06 2.04e+04 weight
0.00e+00 1.46e+03 input[0]
1.56e-03 2.04e+04 output
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-120-ff4a54906908> in <module>
33 # trainer.train()
34 # print(tokenizer.)
---> 35 trainer.train()
36 # eval_loss = trainer.evaluate()
37 # print(f">>> Perplexity: {math.exp(eval_loss['eval_loss']):.2f}")
9 frames
/usr/local/lib/python3.8/dist-packages/transformers/debug_utils.py in forward_hook(self, module, input, output)
278
279 # now we can abort, as it's pointless to continue running
--> 280 raise ValueError(
281 "DebugUnderflowOverflow: inf/nan detected, aborting as there is no point running further. "
282 "Please scroll up above this traceback to see the activation values prior to this event."
ValueError: DebugUnderflowOverflow: inf/nan detected, aborting as there is no point running further. Please scroll up above this traceback to see the activation values prior to this event.
The very first layer seems to start throwing inf/nans when we start training and doesn't go much beyond that
We have tried tweaking our training arguments but have hit a brick wall here. Any help appreciated!