0

I am trying to run fairseq translation task on AML using 4 GPUs (P100)and it fails with the following error:

-- Process 2 terminated with the following error: Traceback (most recent call last): File "/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/distributed_utils.py", line 174, in all_gather_list result.append(pickle.loads(bytes(out_buffer[2 : size + 2].tolist()))) _pickle.UnpicklingError: invalid load key, '\xad'.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py", line 272, in distributed_main main(args, init_distributed=True) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py", line 82, in main train(args, trainer, task, epoch_itr) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py", line 123, in train log_output = trainer.train_step(samples) File "/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/trainer.py", line 305, in train_step [logging_outputs, sample_sizes, ooms, self._prev_grad_norm], File "/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/distributed_utils.py", line 178, in all_gather_list 'Unable to unpickle data from other workers. all_gather_list requires all ' Exception: Unable to unpickle data from other workers. all_gather_list requires all workers to enter the function together, so this error usually indicates that the workers have fallen out of sync somehow. Workers can fall out of sync if one of them runs out of memory, or if there are other conditions in your training script that can cause one worker to finish an epoch while other workers are still iterating over their portions of the data.

2019-09-18 17:28:44,727|azureml.WorkerPool|DEBUG|[STOP]

Error occurred: User program failed with Exception:

-- Process 2 terminated with the following error: Traceback (most recent call last): File "/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/distributed_utils.py", line 174, in all_gather_list result.append(pickle.loads(bytes(out_buffer[2 : size + 2].tolist()))) _pickle.UnpicklingError: invalid load key, '\xad'.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py", line 272, in distributed_main main(args, init_distributed=True) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py", line 82, in main train(args, trainer, task, epoch_itr) File "/mnt/batch/tasks/shared/LS_root/jobs/nlx-ml-neuralrewrite/azureml/pytorch-fairseq_1568826205_6846ecb6/mounts/workspacefilestore/azureml/pytorch-fairseq_1568826205_6846ecb6/train.py", line 123, in train log_output = trainer.train_step(samples) File "/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/trainer.py", line 305, in train_step [logging_outputs, sample_sizes, ooms, self._prev_grad_norm], File "/azureml-envs/azureml_8ef3d311fd9072540e3352d9621cca49/lib/python3.6/site-packages/fairseq/distributed_utils.py", line 178, in all_gather_list 'Unable to unpickle data from other workers. all_gather_list requires all ' Exception: Unable to unpickle data from other workers. all_gather_list requires all workers to enter the function together, so this error usually indicates that the workers have fallen out of sync somehow. Workers can fall out of sync if one of them runs out of memory, or if there are other conditions in your training script that can cause one worker to finish an epoch while other workers are still iterating over their portions of the data.

The same code with same param runs fine on a single local GPU. How do I resolve this issue?

  • Can you show the exact call that is failing? – StefanG Sep 18 '19 at 19:28
  • In particular, which file is being unpickled and where is it coming from? – Daniel Schneider Sep 19 '19 at 08:54
  • It doesn't say which file was being unpickled. I am just creating an estimator as: `est = PyTorch(source_directory='./fairseq', script_params=script_params, compute_target=compute_target, entry_script='train.py', pip_packages=['fairseq', 'tensorboardX'], use_gpu=True) ` and then submitting a run with it. It loads the data, creates the model and then fails after a few minutes with the above error – Juhi Naik Sep 19 '19 at 20:37
  • Weirdly, it seems to run fine in a different workspace with V100 GPUs – Juhi Naik Sep 19 '19 at 20:40

0 Answers0