squad2.0 training error: THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=100 : no CUDA-capable device is detected

Question

!python -m torch.distributed.launch --nproc_per_node=8 /root/examples/run_squad.py \
    --model_type bert \
    --model_name_or_path bert-large-uncased-whole-word-masking \
    --do_train \
    --do_eval \
    --do_lower_case \
    --train_file /root/DATA/train-v2.0.json \
    --predict_file /root/DATA/dev-v2.0.json \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ../root/result/ \
    --per_gpu_eval_batch_size=3   \
    --per_gpu_train_batch_size=3   \

I'm using google colab and I want to training my A&Q dataset which downloaded from SQuad website. But when I run the code above it return me an error.

Can some one help me fix this problem?The full error msg as following and I'll appreciate any suggestions:

this is error msg: [THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=100 : no CUDA-capable device is detected THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=100 : no CUDA-capable device is detected Traceback (most recent call last): File "/root/examples/run_squad.py", line 575, in main() File "/root/examples/run_squad.py", line 469, in main torch.cuda.set_device(args.local_rank) File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 300, in set_device torch._C._cuda_setDevice(device) File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 193, in _lazy_init torch._C._cuda_init() RuntimeError: cuda runtime error (100) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:50 Traceback (most recent call last): File "/root/examples/run_squad.py", line 575, in THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=100 : no CUDA-capable device is detected main() THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=100 : no CUDA-capable device is detected THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=100 : no CUDA-capable device is detected Traceback (most recent call last): File "/root/examples/run_squad.py", line 575, in main() THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=100 : no CUDA-capable device is detected File "/root/examples/run_squad.py", line 469, in main Traceback (most recent call last): File "/root/examples/run_squad.py", line 469, in main torch.cuda.set_device(args.local_rank) File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 300, in set_device torch.cuda.set_device(args.local_rank) torch._C._cuda_setDevice(device) File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 193, in _lazy_init torch._C._cuda_init() Traceback (most recent call last): File "/root/examples/run_squad.py", line 575, in main() RuntimeError: cuda runtime error (100) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:50 File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 300, in set_device torch._C._cuda_setDevice(device) File "/root/examples/run_squad.py", line 575, in File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 193, in _lazy_init File "/root/examples/run_squad.py", line 469, in main torch._C._cuda_init() main() torch.cuda.set_device(args.local_rank) RuntimeError: cuda runtime error (100) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:50 File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 300, in set_device torch._C._cuda_setDevice(device) File "/root/examples/run_squad.py", line 469, in main torch.cuda.set_device(args.local_rank) File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 193, in _lazy_init torch._C._cuda_init() File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 300, in set_device torch._C._cuda_setDevice(device) RuntimeError: cuda runtime error (100) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:50 File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 193, in _lazy_init torch._C._cuda_init() RuntimeError: cuda runtime error (100) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:50 Traceback (most recent call last): File "/root/examples/run_squad.py", line 575, in main() File "/root/examples/run_squad.py", line 469, in main torch.cuda.set_device(args.local_rank) File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 300, in set_device torch._C._cuda_setDevice(device) File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 193, in _lazy_init torch._C._cuda_init() RuntimeError: cuda runtime error (100) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:50 THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=100 : no CUDA-capable device is detected Traceback (most recent call last): File "/root/examples/run_squad.py", line 575, in main() File "/root/examples/run_squad.py", line 469, in main torch.cuda.set_device(args.local_rank) File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 300, in set_device torch._C._cuda_setDevice(device) File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 193, in _lazy_init torch._C._cuda_init() RuntimeError: cuda runtime error (100) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:50 THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=100 : no CUDA-capable device is detected Traceback (most recent call last): File "/root/examples/run_squad.py", line 575, in main() File "/root/examples/run_squad.py", line 469, in main torch.cuda.set_device(args.local_rank) File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 300, in set_device torch._C._cuda_setDevice(device) File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 193, in _lazy_init torch._C._cuda_init() RuntimeError: cuda runtime error (100) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:50 Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 253, in main() File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 249, in main cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', '/root/examples/run_squad.py', '--local_rank=7', '--model_type', 'bert', '--model_name_or_path', 'bert-large-uncased-whole-word-masking', '--do_train', '--do_eval', '--do_lower_case', '--train_file', '/root/DATA/train-v2.0.json', '--predict_file', '/root/DATA/dev-v2.0.json', '--learning_rate', '3e-5', '--num_train_epochs', '2', '--max_seq_length', '384', '--doc_stride', '128', '--output_dir', '../root/result/', '--per_gpu_eval_batch_size=3', '--per_gpu_train_batch_size=3']' returned non-zero exit status 1.]

Did you changed your `runtime type` to GPU (Runtime->Change runtime type->Hardware accelerator)? — cronoik, Dec 12 '19 at 23:03
I am getting the same error while trying to train on a different repo. Were you able to solve this? — dilit, Jan 12 '20 at 15:13

squad2.0 training error: THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=100 : no CUDA-capable device is detected

0 Answers0