HuggingFace Trainer Segmentation Fault

Question

Huggingface Trainer keeps giving Segmentation Fault with this setup code. The dataset is around 600MB, and the server has 2*32GB Nvidia V100. Can anyone help find the issue?

from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling, LineByLineTextDataset
from transformers import GPT2Config, GPT2LMHeadModel, GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("./data/TOKEN")

config = GPT2Config.from_pretrained('gpt2-large')
model = GPT2LMHeadModel(config=config)
tokenizer = GPT2TokenizerFast.from_pretrained("./data/TOKEN", model_max_length=1024)

print('loading dataset...')
dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./data/kowiki.txt",
    block_size=128,
)

training_args = TrainingArguments(
    output_dir='./m',          # output directory
    num_train_epochs=1,              # total # of training epochs
    per_device_train_batch_size=1,  # batch size per device during training - the higher the better, but may OOM
    per_device_eval_batch_size=1,   # batch size for evaluation
    logging_dir='./logs',            # directory for storing logs
    save_steps=10000,
    do_train=True
)

trainer = Trainer(
    model=model,                         # the instantiated  Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=dataset,         # training dataset
)

trainer.train()

Error message :

loading dataset...
Epoch:   0%|                                              | 0/1 [00:00<?, ?it/s]
Fatal Python error: Segmentation fault                | 0/99996 [00:00<?, ?it/s]

Thread 0x00007f872dfff700 (most recent call first):
  File "/opt/conda/lib/python3.6/threading.py", line 299 in wait
  File "/opt/conda/lib/python3.6/threading.py", line 551 in wait
  File "/opt/conda/lib/python3.6/site-packages/tqdm/_monitor.py", line 69 in run
  File "/opt/conda/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/opt/conda/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007f8736bb5700 (most recent call first):
  File "/opt/conda/lib/python3.6/threading.py", line 299 in wait
  File "/opt/conda/lib/python3.6/queue.py", line 173 in get
  File "/opt/conda/lib/python3.6/site-packages/tensorboard/summary/writer/event_file_writer.py", line 205 in run
  File "/opt/conda/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/opt/conda/lib/python3.6/threading.py", line 884 in _bootstrap

Current thread 0x00007f88273e7740 (most recent call first):
  File "/opt/conda/lib/python3.6/site-packages/torch/cuda/comm.py", line 39 in broadcast_coalesced
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 21 in forward
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 71 in _broadcast_coalesced_reshape
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 88 in replicate
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 159 in replicate
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 154 in forward
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577 in __call__
  File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 622 in _training_step
  File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 499 in train
  File "trainer.py", line 34 in <module>
Segmentation fault (core dumped)

Python version is 3.7.7 with Pytorch 1.5.1+cu101 with a 'recently installed' HF transformer & tokenizer(0.8.0).

First time here :D sorry if I didn't keep trivial stuff

EDIT : May be a bug - https://github.com/huggingface/transformers/issues/5590

EDIT 2: Also segfaults on transformer 3.0.2 and tokenizer 0.7.0, 0.8.1-rc1

Please include the full stacktrace -of your error message directly to your question. — cronoik, Jul 07 '20 at 20:18
Yup, added. May be a bug on latest(3.0.2), but that was edited. — efe23eds, Jul 11 '20 at 03:59
The bugreport was created with transformers 2.11.0. Are you using 3.0.2 (`transformers.__version__`)? — cronoik, Jul 11 '20 at 04:11
Both 2.11.0, 3.0.2 produces segfault. Haven't tried 2.10.0 or prior since trainer code is undocumented back then. — efe23eds, Jul 11 '20 at 09:54
what happens when you try to train it only on one GPU? CUDA_VISIBLE_DEVICES=0 python — Berkay Berabi, Apr 17 '21 at 13:15

HuggingFace Trainer Segmentation Fault

0 Answers0