0

In my dockerfile to build the custom docker base image, I specify the following base image:

FROM nvidia/cuda:10.1-cudnn7-devel-ubuntu16.04

The dockerfile corresponding to the nvidia-cuda base image is found here: https://gitlab.com/nvidia/container-images/cuda/blob/master/dist/ubuntu16.04/10.1/devel/cudnn7/Dockerfile

Now when I print the AzureML log:

run = Run.get_context()
# setting device on GPU if available, else CPU
run.log("Using device: ", torch.device('cuda' if torch.cuda.is_available() else 'cpu'))

I get

device(type='cpu')

but I would like to have a GPU and not a CPU. What am I doing wrong?

EDIT: I do not know exactly what you need. But I can give you the following information: azureml.core VERSION is 1.0.57. The compute_target is defined via:

def compute_target(ws, cluster_name):
    try:
        cluster = ComputeTarget(workspace=ws, name=cluster_name)
    except ComputeTargetException:
        compute_config=AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6',min_nodes=0,max_nodes=4)
        cluster = ComputeTarget.create(ws, cluster_name, compute_config)

The experiment is run via:

    ws = workspace(os.path.join("azure_cloud", 'config.json'))
    exp = experiment(ws, name=<name>)
    c_target = compute_target(ws, <name>)
    est = Estimator(source_directory='.',
                   script_params=script_params,
                   compute_target=c_target,
                   entry_script='azure_cloud/azure_training_wrapper.py',
                   custom_docker_image=image_name,
                   image_registry_details=img_reg_details,
                   user_managed = True,
                   environment_variables = {"SYSTEM": "azure_cloud"})

    # run the experiment / train the model
    run = exp.submit(config=est)

The yaml file contains:

dependencies:
  - conda-package-handling=1.3.10
  - python=3.6.2
  - cython=0.29.10
  - scikit-learn==0.21.2
  - anaconda::cloudpickle==1.2.1
  - anaconda::cffi==1.12.3
  - anaconda::mxnet=1.5.0
  - anaconda::psutil==5.6.3
  - anaconda::pycosat==0.6.3
  - anaconda::pip==19.1.1
  - anaconda::six==1.12.0
  - anaconda::mkl==2019.4
  - anaconda::cudatoolkit==10.1.168
  - conda-forge::pycparser==2.19
  - conda-forge::openmpi=3.1.2
  - pytorch::pytorch==1.2.0
  - tensorboard==1.13.1
  - tensorflow==1.13.1
  - tensorflow-estimator==1.13.0
  - pip:
      - pytorch-transformers==1.2.0
      - azure-cli==2.0.72
      - azure-storage-nspkg==3.1.0
      - azureml-sdk==1.0.57
      - pandas==0.24.2
      - tqdm==4.32.1
      - numpy==1.16.4
      - matplotlib==3.1.0
      - requests==2.22.0
      - setuptools==41.0.1
      - ipython==7.8.0
      - boto3==1.9.220
      - botocore==1.12.220
      - cntk==2.7
      - ftfy==5.6
      - gensim==3.8.0
      - horovod==0.16.4
      - keras==2.2.5
      - langdetect==1.0.7
      - langid==1.1.6
      - nltk==3.4.5
      - ptvsd==4.3.2
      - pytest==5.1.2
      - regex==2019.08.19
      - scipy==1.3.1
      - scikit_learn==0.21.3
      - spacy==2.1.8
      - tensorpack==0.9.8

EDIT 2: I tried use_gpu = True as well as upgrading to azureml-sdk=1.0.65 but to no avail. Some people suggest additionally installing cuda-drivers via apt-get install cuda-drivers, but this does not work and I cannot build a docker image with that. The output of nvcc --version on the docker image yields:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

So I think that should be o.k. The docker image itself of course has no GPU, so command nvidia-smi is not found and

python -i

and then

import torch
print(torch.cuda.is_available())

will print False.

mgross
  • 550
  • 1
  • 7
  • 24
  • Please share the code of how you are triggering the Run with custom docker. In particular, what type of compute are you running on? What version of the SDK are you running (find out by running this on the command line `python -c "import azureml.core; print(azureml.core.VERSION)"`)? – Daniel Schneider Oct 01 '19 at 10:12
  • @Daniel Schneider: I edited the post above to add more information. Please tell me if there is anything more that you need. – mgross Oct 01 '19 at 12:27

1 Answers1

1

In your Estimator definition, please try adding use_gpu=True

est = Estimator(source_directory='.',
               script_params=script_params,
               compute_target=c_target,
               entry_script='azure_cloud/azure_training_wrapper.py',
               custom_docker_image=image_name,
               image_registry_details=img_reg_details,
               user_managed = True,
               environment_variables = {"SYSTEM": "azure_cloud"},
               use_gpu=True)

I believe, with azureml-sdk>=1.0.60 this should be inferred from the vm-size used, but since you are using 1.0.57 I think this is still required.

Daniel Schneider
  • 1,797
  • 7
  • 20
  • Neither using `use_gpu=True` nor upgrading the sdk to version `azureml-sdk=1.0.65` helped. See also EDIT 2 in my post. – mgross Oct 03 '19 at 11:14
  • are you able to get a gpu when running this sample: https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/ml-frameworks/pytorch/deployment/train-hyperparameter-tune-deploy-with-pytorch/train-hyperparameter-tune-deploy-with-pytorch.ipynb ? – Daniel Schneider Oct 05 '19 at 05:47