3

I'm currently running jobs on Vertex AI and I encountered the following problem :

"error": {
    "code": 429,
    "message": "The following quota metrics exceed quota limits: aiplatform.googleapis.com/custom_model_training_nvidia_p4_gpus",
    "status": "RESOURCE_EXHAUSTED"
  }

Last Friday, I had this error, and Monday, it worked again. Since then, I ran 8 jobs and the error came back.

I read Google documentation on Quotas and checked Quotas on IAM and Admin, but I didn't really understand it. It didn't seem that I exceeded something. Could someone explain to me how quotas work?

kellya
  • 31
  • 2

1 Answers1

1

That particular quota aiplatform.googleapis.com/custom_model_training_nvidia_p4_gpus appears to be the same as "Number of concurrent P4 GPUs for training, per region" listed in the Vertex AI quotas doc. As I understand it, this quota means that you cannot have training running concurrently that uses more than the quota at any given time. So, for example, if you're training in us-central1, which has a default quota limit of 6 for P4 GPUs, all your training jobs currently running cannot use more than 6 P4 GPUs in total.

Some options to address this:

  • You can wait for the training jobs to finish, which will free up the quota (this is likely why it worked again on that Monday after not working on the previous Friday).
  • You can select a different accelerator type for your training, since different accelerator types have different quotas.
  • You can train in another region that has quota for P4 GPUs. However, the resulting model will be in whatever region you train in, in case that's an issue for you.
JankyJ
  • 11
  • 3