Can anyone help me identify the "bug" in my Google Cloud ML training job?

Question

I was following the link below to replicate the process with new data and new model:

https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_pets.md

Until reach the last step, I activate the training job with the script below:

gcloud ml-engine jobs submit training `whoami`_object_detection_`date +%s` \
--runtime-version 1.4 \
--job-dir=gs://marksbucket0000/train \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train \
--region us-east1 \
--config /Users/markli/Desktop/chase_ad_object/project_2/cluster_config/cloud.yml \
-- \
--train_dir=gs://marksbucket0000/train \
--pipeline_config_path=gs://marksbucket0000/data/ssd_mobilenet_v1_coco.config

It seems the job is kicking off successfully:

ob [xxx_object_detection_xxxxxxx] submitted successfully.
Your job is still active. You may view the status of your job with the command

$ gcloud ml-engine jobs describe xxx_object_detection_xxxxxxx

or continue streaming the logs with the command

However, it stops due to the following errors in the log:

Since I am extremely new to Google ML could and tensorflow object detection api, I couldn't find a clue from the log regrading which step I was doing wrong.

The YML cluster configuration file I was using is:

trainingInput:
runtimeVersion: "1.4"
scaleTier: CUSTOM
masterType: standard_gpu
workerCount: 5
workerType: standard_gpu
parameterServerCount: 3
parameterServerType: standard

I would really appreciate if anyone could at least show me a direction to debug. Thanks so much in advance!

---------------- Update on the question --------------

I have actually got it working by changing the setup.py as below:

"""Setup script for object_detection."""

from setuptools import find_packages
from setuptools import setup


# REQUIRED_PACKAGES = ['Pillow>=1.0', 'Matplotlib>=2.1', 'Cython>=0.28.1']
REQUIRED_PACKAGES = ['Tensorflow>=1.4.0','Pillow>=1.0','Matplotlib>=2.1','Cython>=0.28.1','Jupyter']

setup(
    name='object_detection',
    version='0.1',
    install_requires=REQUIRED_PACKAGES,
    include_package_data=True,
    packages=[p for p in find_packages() if p.startswith('object_detection')],
    description='Tensorflow Object Detection Library',
)

Though I have running into some "no module found" issue when running the training job, there are a lot of online conversation can quickly identify the solution for it so i am not replicating them here.

However, I did stuck by a issue when running evaluation job - "cannot import pycocotool" and for which i found the solution here: https://github.com/tensorflow/models/issues/3470

Now, both of my training and evaluation jobs are up and running. However, it seems strange that I couldn't see any statistics (ex.loss plot in orange) show up for evaluation job on the tensorbroad's scalar display (Yet, I do see the eval job check-box shows up as a view option in it):

I have also checked the log in eval job and i found the node seems to constantly skipping the image. Is this the cause to the issue? May be some issue with the evaluation dataset?

Log info in eval job:

score 0 · Answer 1 · answered Jun 17 '18 at 16:27

0

Parallel interleave functionality is available only in TensorFlow 1.5+. Try changing the line in your YAML to:

runtimeVersion: "1.8"

answered Jun 17 '18 at 16:27

Lak

3,876
20
34

Thanks for the answer. I have tried to simply using a higher runtime version like 1.8. It doesn't solve the issue. It's actually my fault that I haven't included the termination reason - "no data module found". However, i have fixed it by changing the required packages in "setup.py" file. I have updated my questions above with another issue I've encountered. Please let me know if you know the possible cause to it. Again, thanks so much for the help! – Mark Li Jun 17 '18 at 18:10

Can anyone help me identify the "bug" in my Google Cloud ML training job?

1 Answers1