2

I get a disk full error while running a Model training job using Azure ML SDK launched from Azure DevOps. I created a custom environment inside the Azure ML Workspace and used it.

I am using azure CLI tasks in Azure DevOps to launch these training jobs. How can I resolve the disk full issue?

Error Message shown in the DevOps Training Task:

"error": {
        "code": "UserError",
        "message": "{\"Compliant\":\"Disk full while running job. Please consider reducing amount of data accessed, or upgrading VM SKU. Total space: 14045 MB, available space: 1103 MB.\"}\n{\n  \"code\": \"DiskFullError\",\n  \"target\": \"\",\n  \"category\": \"UserError\",\n  \"error_details\": []\n}",
        "messageParameters": {},
        "details": []
    },

The .runconfig file for the training job:

 framework: Python
 script: cnn_training.py
 communicator: None
 autoPrepareEnvironment: true
 maxRunDurationSeconds:
 nodeCount: 1
 environment:
   name: cnn_training
   python:
     userManagedDependencies: true
     interpreterPath: python
   docker:
     enabled: true
     baseImage: 54646eeace594cf19143dad3c7f31661.azurecr.io/azureml/azureml_b17300b63a1c2abb86b2e774835153ee
     sharedVolumes: true
     gpuSupport: false
     shmSize: 2g
     arguments: []
 history:
   outputCollection: true
   snapshotProject: true
   directoriesToWatch:
   - logs
 dataReferences:
   workspaceblobstore:
     dataStoreName: workspaceblobstore
     pathOnDataStore: dataname
     mode: download
     overwrite: true
     pathOnCompute:

Is there an additional configuration to be done for the disk full issue? Any Changes to be made in the .runconfig file?

Imperial_J
  • 306
  • 1
  • 7
  • 23

2 Answers2

1

According to your error message below, we suppose that your issue is resulted from the storage space lacking with your Compute Cluster or VM Sku.

Disk full while running job. Please consider reducing amount of data accessed, or upgrading VM SKU. Total space: 14045 MB, available space: 1103 MB.

I suggest that you could consider the three steps below, and then test again.

1.Clear the storage cache,

2.Upgrade your cluster storage size

3.Optimize your machine learning resource size

=========================

Updated 11/10

Hi L_Jay You could refer to Azure Machine Learning to upgrade your subscription for better performance instance.

Ceeno Qi-MSFT
  • 924
  • 1
  • 3
  • 5
  • I am currently using Azure ML Studio Compute to run the Training job. I tried using a better compute instance, I was initially using 2 cores which I then upgraded to 4, and still, the error persists. The training Jobs are launched from Azure CLI which is run from Azure DevOps. I am using the default agent there which is given by Azure. How can I clear the storage cache in this case? I am using this default agent: https://github.com/actions/runner-images/blob/main/images/linux/Ubuntu2204-Readme.md – Imperial_J Nov 09 '22 at 03:43
  • @L_Jay have you tried to upgrade the disk storage of your instance, or you could also share the property of your instance with us? – Ceeno Qi-MSFT Nov 10 '22 at 06:29
  • I am using a compute offered by Azure ML Studio. The size is "STANDARD_D2_V3" – Imperial_J Nov 10 '22 at 08:40
  • @L_Jay, you could check the update in my post – Ceeno Qi-MSFT Nov 10 '22 at 09:03
  • I actually upgraded the compute to a "D13 v2" and still the error persists – Imperial_J Nov 11 '22 at 11:34
  • @L_Jay to narrow down your issue, have you ever tried to run the script directly from your local machine? – Ceeno Qi-MSFT Dec 01 '22 at 09:19
  • The script works fine in Azure ML notebooks – Imperial_J Dec 02 '22 at 02:53
0

I have a suspicion your disk full is due to memory leaking into swap. Double check you are not making extraneous objects in your code. And that you are not loading too much training data without clearing it out.

I have made this mistake on a local machine, front loading my data into my ML script and maxing out my memory. As opposed to loading data piecewise then deleting it after a training iteration.

Also this is a guess but have you tried modifying your shmSize: 2g parameter? https://docs.docker.com/engine/reference/run/#runtime-constraints-on-resources

i0x539
  • 4,763
  • 2
  • 20
  • 28
  • I tried running with 15g and still the error persisted. – Imperial_J Nov 21 '22 at 02:46
  • The disk full issue comes when i tried to use TensorFlow in my scripts. Other Scripts which run on Sklearn are fine. So I believe it is not an issue related to the data size. Just that I cant figure out how to give extra space to install and Run with Tensorflow. – Imperial_J Nov 21 '22 at 02:48