AI Platform Pipelines sometimes and randomly fails

Question

I've been using AI Platform Pipelines (v0.2.5) for several months. I rebuilt the Pipelines instance because I've found a newer version (v0.5.1) on Console. I'm now stuck in completing Pipelines.

It's very weird because there seems not to be failure patterns.

Pods(Components) randomly fails. Most of the pods successfully complete, while some fail. In addition, failed pods vary depending on the time of executions.
Pods tell me the error messages of two below, randomly.

google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. 
Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. 
For more information, please see https://cloud.google.com/docs/authentication/getting-started

File "", line 3, in raise_from google.auth.exceptions.RefreshError: ("Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true from the Google Compute Enginemetadata service. Status: 500 Response:\nb'Could not recursively fetch uri\n'", <google.auth.transport.requests._Response object at 0x7fe5729c9650>)

At GKE Cluster Workload Identity is set. I surely confirm the procedure and the setting is no problem. Though some pods fail, the other pods successfully run with Workload Identity. Of course, Google Cloud Credentials API is enabled.

I don't know these problems are caused by updating Pipelines instance.

Any ideas?

Hey @oguogura, can you verify that all of your pods are running in the nodes from the same node pool? It seems to me the failed pods are running from the different node pool than the successful pods. — mdtp, Jul 22 '20 at 16:29

AI Platform Pipelines sometimes and randomly fails

0 Answers0