Huggingface Trainer keeps giving Segmentation Fault
with this setup code.
The dataset is around 600MB, and the server has 2*32GB Nvidia V100. Can anyone help find the issue?
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling, LineByLineTextDataset
from transformers import GPT2Config, GPT2LMHeadModel, GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("./data/TOKEN")
config = GPT2Config.from_pretrained('gpt2-large')
model = GPT2LMHeadModel(config=config)
tokenizer = GPT2TokenizerFast.from_pretrained("./data/TOKEN", model_max_length=1024)
print('loading dataset...')
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path="./data/kowiki.txt",
block_size=128,
)
training_args = TrainingArguments(
output_dir='./m', # output directory
num_train_epochs=1, # total # of training epochs
per_device_train_batch_size=1, # batch size per device during training - the higher the better, but may OOM
per_device_eval_batch_size=1, # batch size for evaluation
logging_dir='./logs', # directory for storing logs
save_steps=10000,
do_train=True
)
trainer = Trainer(
model=model, # the instantiated Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=dataset, # training dataset
)
trainer.train()
Error message :
loading dataset...
Epoch: 0%| | 0/1 [00:00<?, ?it/s]
Fatal Python error: Segmentation fault | 0/99996 [00:00<?, ?it/s]
Thread 0x00007f872dfff700 (most recent call first):
File "/opt/conda/lib/python3.6/threading.py", line 299 in wait
File "/opt/conda/lib/python3.6/threading.py", line 551 in wait
File "/opt/conda/lib/python3.6/site-packages/tqdm/_monitor.py", line 69 in run
File "/opt/conda/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/opt/conda/lib/python3.6/threading.py", line 884 in _bootstrap
Thread 0x00007f8736bb5700 (most recent call first):
File "/opt/conda/lib/python3.6/threading.py", line 299 in wait
File "/opt/conda/lib/python3.6/queue.py", line 173 in get
File "/opt/conda/lib/python3.6/site-packages/tensorboard/summary/writer/event_file_writer.py", line 205 in run
File "/opt/conda/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/opt/conda/lib/python3.6/threading.py", line 884 in _bootstrap
Current thread 0x00007f88273e7740 (most recent call first):
File "/opt/conda/lib/python3.6/site-packages/torch/cuda/comm.py", line 39 in broadcast_coalesced
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 21 in forward
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 71 in _broadcast_coalesced_reshape
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 88 in replicate
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 159 in replicate
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 154 in forward
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577 in __call__
File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 622 in _training_step
File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 499 in train
File "trainer.py", line 34 in <module>
Segmentation fault (core dumped)
Python version is 3.7.7 with Pytorch 1.5.1+cu101 with a 'recently installed' HF transformer & tokenizer(0.8.0).
First time here :D sorry if I didn't keep trivial stuff
EDIT : May be a bug - https://github.com/huggingface/transformers/issues/5590
EDIT 2: Also segfaults on transformer 3.0.2 and tokenizer 0.7.0, 0.8.1-rc1