1

I am attempting to run a Training Job on Google's Cloud ML. The signs that I have of my job running are:

  • Messages such as these indicating the package was built and installed:

INFO 2017-06-07 15:14:01 -0700 master-replica-0 Successfully built training-job-foo

INFO 2017-06-07 15:14:01 -0700 master-replica-0 Installing collected packages: training-job-foo

INFO 2017-06-07 15:14:01 -0700 master-replica-0 Successfully installed training-job-foo-0.1.dev0

INFO 2017-06-07 15:14:01 -0700 master-replica-0 Running command: pip install --user training-job-foo-0.1.dev0.tar.gz

INFO 2017-06-07 15:14:02 -0700 master-replica-0 Processing ./training-job-foo-0.1.dev0.tar.gz

  • Messages like this indicating that my job is starting:

INFO 2017-06-07 15:14:03 -0700 master-replica-0 Running command: python -m training-job-foo.training_routine_bar --job-dir gs://regional-bucket-similar-to-training-job/output/

  • A message like this indicating that my scalar summaries are being processed:

INFO 2017-06-07 15:14:21 -0700 master-replica-0 Summary name Total Accuracy is illegal; using Total_Accuracy instead.

  • Finally, I also see CPU, Memory usage increase and my consumedMLUnits increase

  • I should add, I also see the summary Filewriters create the summary files before the jobs are created but I dont see those files increase in size. I also see an initial checkpoint file written to gs://regional-bucket-similar-to-training-job/output/

Other than that I see no further logs or outputs. I should be seeing logs since I print accuracy, loss every so often. I also write summaries and checkpoint files.

What am I missing?

Also what other debugging tools are available in such scenarios? All I am doing currently is streaming logs, watching the job status, CPU Usage, Memory Usage on the Cloud ML console and watching my Cloud Storage bucket for any changes

Community
  • 1
  • 1
7hacker
  • 1,928
  • 3
  • 19
  • 32

1 Answers1

2

Sorry that you are experiencing issues. Currently, the available debugging tools are Job logs, metrics and TensorBoard, but seems like all of these can't be used in your case. If possible, can you please send us your project number and job id to cloudml-feedback@google.com,so that we can take a close look?

Guoqing Xu
  • 482
  • 3
  • 9