Getting error when running deepseed in dolly training with exits with return code = -9

Question

Describe the bug I started running deepseed config on EleutherAI/pythia-2.8b model, I ran into error exits with run code= -9. After splitting and preprocessing the dataset i am getting [ERROR] [launch.py:434:sigkill_handler].

Log output I used train_dolly.py to launch the application. It should save files in give locations.

deepseed config

{
    "bf16": {
    "enabled": "auto"
    },
    "optimizer": {
    "type": "AdamW",
    "params": {
    "lr": "auto",
    "betas": "auto",
    "eps": "auto",
    "weight_decay": "auto"
    }
    },
    "scheduler": {
    "type": "WarmupLR",
    "params": {
    "warmup_min_lr": "auto",
    "warmup_max_lr": "auto",
    "warmup_num_steps": "auto"
    }
    },
    "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
    "device": "cpu",
    "pin_memory": true
     },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
    }

deepspeed code

!deepspeed {num_gpus_flag} \
    --module training.trainer \
    --input-model {input_model} \
    --deepspeed {deepspeed_config} \
    --epochs 2 \
    --local-output-dir {local_output_dir} \
    --dbfs-output-dir {dbfs_output_dir} \
    --per-device-train-batch-size 1 \
    --per-device-eval-batch-size 1 \
    --logging-steps 10 \
    --save-steps 200 \
    --save-total-limit 20 \
    --eval-steps 50 \
    --warmup-steps 50 \
    --test-size 200 \
    --lr 5e-6

Output when running deepspeed code

[2023-05-05 07:58:29,770] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-05 07:58:29,778] [INFO] [runner.py:541:main] cmd = /local_disk0/.ephemeral_nfs/envs/pythonEnv-9708c979-1e4b-49c6-aafc-8ef52637190f/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --module --enable_each_rank_log=None training.trainer --input-model EleutherAI/pythia-2.8b --deepspeed /Workspace/Repos/dinesh/dolly/config/ds_z3_bf16_config.json --epochs 2 --local-output-dir /local_disk0/dolly_training/dolly__01__2023-05-05T07:58:09 --dbfs-output-dir /dbfs/FileStore/tables/dolly_training/dolly__01__2023-05-05T07:58:09 --per-device-train-batch-size 1 --per-device-eval-batch-size 1 --logging-steps 10 --save-steps 200 --save-total-limit 20 --eval-steps 50 --warmup-steps 50 --test-size 200 --lr 5e-6
[2023-05-05 07:58:33,188] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]}
[2023-05-05 07:58:33,189] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-05-05 07:58:33,189] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-05-05 07:58:33,189] [INFO] [launch.py:247:main] dist_world_size=1
[2023-05-05 07:58:33,189] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0
2023-05-05 07:58:43 INFO [__main__] Loading tokenizer for EleutherAI/pythia-2.8b
2023-05-05 07:58:44 INFO [__main__] Loading model for EleutherAI/pythia-2.8b
Downloading (…)lve/main/config.json: 100%|██████| 571/571 [00:00<00:00, 195kB/s]
Downloading pytorch_model.bin: 100%|████████| 5.68G/5.68G [00:12<00:00, 461MB/s]
2023-05-05 07:59:20 INFO [__main__] Found max lenth: 2048
2023-05-05 07:59:20 INFO [__main__] Loading dataset from Dinesh007a/classification_review_data
2023-05-05 07:59:22 WARNING [datasets.builder] Found cached dataset json (/root/.cache/huggingface/datasets/Dinesh007a___json/Dinesh007a--classification_review_data-943b17268cd4d80e/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 920.21it/s]
2023-05-05 07:59:22 INFO [__main__] Found 6096 rows
2023-05-05 07:59:22 INFO [__main__] Preprocessing dataset                       
2023-05-05 07:59:24 INFO [__main__] Processed dataset has 6096 rows             
2023-05-05 07:59:24 INFO [__main__] Processed dataset has 6096 rows after filtering for truncated records
2023-05-05 07:59:24 INFO [__main__] Shuffling dataset
2023-05-05 07:59:24 INFO [__main__] Done preprocessing
2023-05-05 07:59:24 INFO [__main__] Train data size: 5896
2023-05-05 07:59:24 INFO [__main__] Test data size: 200
[2023-05-05 07:59:24,626] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
2023-05-05 07:59:24 INFO [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:1 to store for rank: 0
2023-05-05 07:59:24 INFO [torch.distributed.distributed_c10d] Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
2023-05-05 07:59:24 INFO [__main__] Instantiating Trainer
2023-05-05 07:59:24 INFO [__main__] Training
2023-05-05 07:59:27 INFO [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:2 to store for rank: 0
2023-05-05 07:59:27 INFO [torch.distributed.distributed_c10d] Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 1 nodes.
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py39_cu117/cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/local_disk0/.ephemeral_nfs/envs/pythonEnv-9708c979-1e4b-49c6-aafc-8ef52637190f/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-9708c979-1e4b-49c6-aafc-8ef52637190f/lib/python3.9/site-packages/torch/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-9708c979-1e4b-49c6-aafc-8ef52637190f/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-9708c979-1e4b-49c6-aafc-8ef52637190f/lib/python3.9/site-packages/torch/include/TH -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-9708c979-1e4b-49c6-aafc-8ef52637190f/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 -c /local_disk0/.ephemeral_nfs/envs/pythonEnv-9708c979-1e4b-49c6-aafc-8ef52637190f/lib/python3.9/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
[2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/local_disk0/.ephemeral_nfs/envs/pythonEnv-9708c979-1e4b-49c6-aafc-8ef52637190f/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-9708c979-1e4b-49c6-aafc-8ef52637190f/lib/python3.9/site-packages/torch/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-9708c979-1e4b-49c6-aafc-8ef52637190f/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-9708c979-1e4b-49c6-aafc-8ef52637190f/lib/python3.9/site-packages/torch/include/TH -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-9708c979-1e4b-49c6-aafc-8ef52637190f/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -c /local_disk0/.ephemeral_nfs/envs/pythonEnv-9708c979-1e4b-49c6-aafc-8ef52637190f/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
[3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -lcurand -L/local_disk0/.ephemeral_nfs/envs/pythonEnv-9708c979-1e4b-49c6-aafc-8ef52637190f/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 29.33124089241028 seconds
Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py39_cu117/utils...
Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-9708c979-1e4b-49c6-aafc-8ef52637190f/lib/python3.9/site-packages/torch/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-9708c979-1e4b-49c6-aafc-8ef52637190f/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-9708c979-1e4b-49c6-aafc-8ef52637190f/lib/python3.9/site-packages/torch/include/TH -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-9708c979-1e4b-49c6-aafc-8ef52637190f/lib/python3.9/site-packages/torch/include/THC -isystem /usr/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /local_disk0/.ephemeral_nfs/envs/pythonEnv-9708c979-1e4b-49c6-aafc-8ef52637190f/lib/python3.9/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o 
[2/2] c++ flatten_unflatten.o -shared -L/local_disk0/.ephemeral_nfs/envs/pythonEnv-9708c979-1e4b-49c6-aafc-8ef52637190f/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
Loading extension module utils...
Time to load utils op: 15.785842180252075 seconds
Rank: 0 partition count [1] and sizes[(2775086080, False)] 
[2023-05-05 08:01:08,321] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 2427
[2023-05-05 08:01:08,324] [ERROR] [launch.py:434:sigkill_handler] ['/local_disk0/.ephemeral_nfs/envs/pythonEnv-9708c979-1e4b-49c6-aafc-8ef52637190f/bin/python', '-u', '-m', 'training.trainer', '--local_rank=0', '--input-model', 'EleutherAI/pythia-2.8b', '--deepspeed', '/Workspace/Repos/dinesh/dolly/config/ds_z3_bf16_config.json', '--epochs', '2', '--local-output-dir', '/local_disk0/dolly_training/dolly__01__2023-05-05T07:58:09', '--dbfs-output-dir', '/dbfs/FileStore/tables/dolly_training/dolly__01__2023-05-05T07:58:09', '--per-device-train-batch-size', '1', '--per-device-eval-batch-size', '1', '--logging-steps', '10', '--save-steps', '200', '--save-total-limit', '20', '--eval-steps', '50', '--warmup-steps', '50', '--test-size', '200', '--lr', '5e-6'] exits with return code = -9

Additional context Please let me know, what is the issue and [info] tells in the screenshots.

I am trying to run train_dolly.py appplication in databricks repo, inside that deepseed config is throughing error with exits with run code -9.

I am expecting the files should store in above mentioned location i provided.

System info (please complete the following information):

I tried in databricks with Databricks runtime 11.3. Node type - single node 56GB Memory, 1GPU. Python version 3.9.

Please update your post to paste the text instead of the screenshots... Also add the node type information — Alex Ott, May 07 '23 at 09:53

Getting error when running deepseed in dolly training with exits with return code = -9

0 Answers0