6

I'm using the Vertex AI custom training feature at Google Cloud Platform (GCP) to train the model. But every time I triggered training, it takes 10 minutes until it actually starts training due to provisioning time.

Is there any way to reduce the provisioning time of Vertex AI's custom training jobs. Thanks :)

  • What are the requirement of your training cluster? GPUs? TPUs? Number of CPUs? And How long take your training job? – guillaume blaquiere May 30 '21 at 16:12
  • I tried to run with GPUs (specifically a2-highgpu-1g machine with A100 x1). The actual training jobs take 40 minutes and 10 minutes for provisioning time. – Junseong Kim Jun 01 '21 at 01:28
  • Without GPU, it takes about 3 minutes to start. I assume with GPU it's longer. I think there is no way to speed up the process. In your case, the warmup take 20% of your training job, which is huge. Did you try to set up the same VM on Compute Engine and to see if the provisioning was quicker? If so, you can start your container at startup (startup script) and perform your training directly on Compute Engine and not en Vertex AI; But it will require more technical/IaaS skills than with Vertex AI. – guillaume blaquiere Jun 01 '21 at 07:07
  • What is your code structure (python code or maybe custom container image)? How you are loading data (Its in the cloud storage, are you storing it with your code)? Are you using any library? – PjoterS Jun 02 '21 at 08:11
  • I used the custom container with a prebuilt PyTorch image. I used the GCS for data loading. The code is contained in the docker image. – Junseong Kim Jun 04 '21 at 06:51
  • Many things might cause this. Depends how much information can you share. Whats the location of the nodes, utulization of GPUs? Which region was used for training and which macihne - [Pre-built containers for custom training](https://cloud.google.com/vertex-ai/docs/training/pre-built-containers) – PjoterS Jun 07 '21 at 09:51
  • @PjoterS I used the us-central1-b location for custom trianing job. Utilization is about to 80%. – Junseong Kim Jun 07 '21 at 13:31
  • 1
    Honestly I doubt that it can be reduced as per [Streamline your ML training workflow with Vertex AI](https://cloud.google.com/blog/topics/developers-practitioners/streamline-your-ml-training-workflow-vertex-ai) - `The training job will automatically provision computing resources, and de-provision those resources when the job is complete. There is no worrying about leaving a high-performance virtual machine configuration running.` However if you want to be 100% sure, you could create Issue for Google team using [Issue Tracker](https://issuetracker.google.com) – PjoterS Jun 09 '21 at 09:21

0 Answers0