0

I am using Dockerflow to run parallel tasks through the Google Pipelines API on Google Cloud Platform. I started a single-step task running 1389 VMs in parallel and found that 233 of the VMs were apparently doing nothing and hanging indefinitely.

I did a spot check of the serial console output and repeatedly saw the VMs running into "Getting controller config failed" errors.

When I tried logging into the VMs I received the error: "Connection Failed. We are unable to connect to the VM on port 22".

I am wondering why my VM instances are hanging, and if there is something I can do to avoid running into these issues.

I've included a snippet of the serial console output below

startupscript: +++ readlink -f /usr/share/google-genomics/startup.sh
startupscript: ++ dirname /usr/share/google-genomics/startup.sh
startupscript: + cd /usr/share/google-genomics
startupscript: + ./controller --operation_id <id> --validation_token <token> --base_path https://genomics.googleapis.com
create controller[2905]: Getting controller config
create controller[2905]: Getting controller config failed, will retry: Get <link>: Get <service_account_token_link>: net/http: timeout awaiting response headers
create controller[2905]: Getting controller config failed, will retry: Get <link>: dial tcp 74.125.26.95:443: i/o timeout
collectd[2342]: write_gcm: Asking metadata server for auth token
collectd[2342]: write_gcm: curl_easy_perform() failed: Couldn't connect to server
collectd[2342]: write_gcm: Error -1 from wg_curl_get_or_post
collectd[2342]: write_gcm: wg_transmit_unique_segment failed.
collectd[2342]: write_gcm: wg_transmit_unique_segments failed. Flushing.

2 Answers2

1

there was a temporary networking issue in us-east1-b. All 3 above VMs were in us-east1-b. These minor incidents do not appear in https://status.cloud.google.com/

Serial console output for a successful run looks like:

A Feb 21 19:05:06 ggp-5629907348021283130 startupscript: + ./controller --operation_id --validation_token --base_path https://autopush-genomics.sandbox.googleapis.com A Feb 21 19:05:06 ggp-5629907348021283130 create controller[2689]: Getting controller config A Feb 21 19:05:36 ggp-5629907348021283130 create controller[2689]: Getting controller config failed, will retry: Get https://genomics.googleapis.com/v1alpha2/pipelines:getControllerConfig?alt=json&operationId=&validationToken=: dial tcp 173.194.212.81:443: i/o timeout A Feb 21 19:05:43 ggp-5629907348021283130 controller[2689]: Switching to status: pulling-image A Feb 21 19:05:43 ggp-5629907348021283130 controller[2689]: Calling SetOperationStatus(pulling-image) A Feb 21 19:05:44 ggp-5629907348021283130 controller[2689]: SetOperationStatus(pulling-image) succeeded

The "Getting controller config failed, will retry" is fine. It succeeded upon retry. The "SetOperationStatus(pulling-image) succeeded" indicates networking is working.

In theory, you can submit any number of jobs to Pipelines API and the API will take care of queueing.

If these temporary networking hiccups become common, we may consider changing Pipelines API to somehow detect and retry.

Melissa
  • 902
  • 2
  • 7
  • 11
0

there may have been a temporary networking issue. Can you give me some failed operation ids (or failed VM names)?

Have you tried again since then; can you reproduce the problem?

Melissa
  • 902
  • 2
  • 7
  • 11
  • 1
    Hi Melissa, Thanks for your response! Yes, here are a few of the failed VM names: ggp-10216049259697508221, ggp-10257299594135474280, ggp-1028157029596421767. I tried again, just running the batch of 233 failed jobs, and they all completed successfully. Thinking about it more, it looks like the VMs encountered errors getting data from the Google Genomics API server (https://genomics.googleapis.com/v1alpha2/). I'm thinking I may have just overloaded it but submitted 1000+ jobs at once. – Paul Billing-Ross Feb 16 '17 at 20:53