GCP AI Platform job is stuck

Question

I'm running a job on AI Platform and it's running for over an hour with no progress, no results, no logs(only few logs showing it's running)

Here is the region, machine type, gpus I was using:

  "region": "us-central1",
  "runtimeVersion": "2.2",
  "pythonVersion": "3.7",
  "masterConfig": {
    "acceleratorConfig": {
      "count": "8",
      "type": "NVIDIA_TESLA_K80"
    }
  }

the AI Platform job

only few logs for this job

The model I'm training is big and uses a lot of memory. The job is just hanging there without any progress, logs or errors. But I notice that it's consumed 12.81 ML units on GCP. Normally, if the GPU is running out of memory, it would throw an "OOM/resourceExhausted error". Without logs, I have no idea what's wrong there.

I ran a different job with smaller dimension of the input and it completed successfully in 12 mins:

successed job

Also, I use tf.MirroredStrategy for the training process so that it can distribute across GPUs.

Any thoughts on this?

Hello, is this still happening to you? can you share a sample of the code so we understand what is doing? — Fernando C., Nov 18 '20 at 12:02
Same issue happens for me in the very same conditions: GPUs, MirroredStrategy. Units are consumed. If I look at the Master CPU utilization, I can see that it is at 0%, GPU utilization is at 20% for a few minutes then no more utilization is appearing on the graph (not utilization of 0, the line on the graph is broken). No training occurs. — Patrick, May 05 '21 at 15:42
There is though an activity that can be seen on Network panel. Speed of Sent Bytes is very low (<2,000/s), speed of received bytes is variable (between 1,000 and 5,000/s). I add that the behavior described above seems to happen only when I start training from a previously saved model (trained with distributed strategy too). I cannot believe that loading a 300MB can be the explanation (job is running for more than 3 hours). — Patrick, May 05 '21 at 15:50
Could it be related to this issue: https://github.com/tensorflow/tensorflow/issues/46146 ("Model loaded from a SavedModel format in a distribution strategy has weights whose names are not unique") ? — Patrick, May 05 '21 at 15:51
Update: I've reduced batch size and it's working. It's very annoying though that I have to pay for a job that ran for over an hour without progress... — Patrick, May 05 '21 at 17:15

GCP AI Platform job is stuck

0 Answers0