I am attempting to run a Training Job on Google's Cloud ML. The signs that I have of my job running are:
- Messages such as these indicating the package was built and installed:
INFO 2017-06-07 15:14:01 -0700 master-replica-0 Successfully built training-job-foo
INFO 2017-06-07 15:14:01 -0700 master-replica-0 Installing collected packages: training-job-foo
INFO 2017-06-07 15:14:01 -0700 master-replica-0 Successfully installed training-job-foo-0.1.dev0
INFO 2017-06-07 15:14:01 -0700 master-replica-0 Running command: pip install --user training-job-foo-0.1.dev0.tar.gz
INFO 2017-06-07 15:14:02 -0700 master-replica-0 Processing ./training-job-foo-0.1.dev0.tar.gz
- Messages like this indicating that my job is starting:
INFO 2017-06-07 15:14:03 -0700 master-replica-0 Running command: python -m training-job-foo.training_routine_bar --job-dir gs://regional-bucket-similar-to-training-job/output/
- A message like this indicating that my scalar summaries are being processed:
INFO 2017-06-07 15:14:21 -0700 master-replica-0 Summary name Total Accuracy is illegal; using Total_Accuracy instead.
Finally, I also see CPU, Memory usage increase and my consumedMLUnits increase
I should add, I also see the summary Filewriters create the summary files before the jobs are created but I dont see those files increase in size. I also see an initial checkpoint file written to gs://regional-bucket-similar-to-training-job/output/
Other than that I see no further logs or outputs. I should be seeing logs since I print accuracy, loss every so often. I also write summaries and checkpoint files.
What am I missing?
Also what other debugging tools are available in such scenarios? All I am doing currently is streaming logs, watching the job status, CPU Usage, Memory Usage on the Cloud ML console and watching my Cloud Storage bucket for any changes