I was following the link below to replicate the process with new data and new model:
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_pets.md
Until reach the last step, I activate the training job with the script below:
gcloud ml-engine jobs submit training `whoami`_object_detection_`date +%s` \
--runtime-version 1.4 \
--job-dir=gs://marksbucket0000/train \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train \
--region us-east1 \
--config /Users/markli/Desktop/chase_ad_object/project_2/cluster_config/cloud.yml \
-- \
--train_dir=gs://marksbucket0000/train \
--pipeline_config_path=gs://marksbucket0000/data/ssd_mobilenet_v1_coco.config
It seems the job is kicking off successfully:
ob [xxx_object_detection_xxxxxxx] submitted successfully.
Your job is still active. You may view the status of your job with the command
$ gcloud ml-engine jobs describe xxx_object_detection_xxxxxxx
or continue streaming the logs with the command
However, it stops due to the following errors in the log:
Since I am extremely new to Google ML could and tensorflow object detection api, I couldn't find a clue from the log regrading which step I was doing wrong.
The YML cluster configuration file I was using is:
trainingInput:
runtimeVersion: "1.4"
scaleTier: CUSTOM
masterType: standard_gpu
workerCount: 5
workerType: standard_gpu
parameterServerCount: 3
parameterServerType: standard
I would really appreciate if anyone could at least show me a direction to debug. Thanks so much in advance!
---------------- Update on the question --------------
I have actually got it working by changing the setup.py as below:
"""Setup script for object_detection."""
from setuptools import find_packages
from setuptools import setup
# REQUIRED_PACKAGES = ['Pillow>=1.0', 'Matplotlib>=2.1', 'Cython>=0.28.1']
REQUIRED_PACKAGES = ['Tensorflow>=1.4.0','Pillow>=1.0','Matplotlib>=2.1','Cython>=0.28.1','Jupyter']
setup(
name='object_detection',
version='0.1',
install_requires=REQUIRED_PACKAGES,
include_package_data=True,
packages=[p for p in find_packages() if p.startswith('object_detection')],
description='Tensorflow Object Detection Library',
)
Though I have running into some "no module found" issue when running the training job, there are a lot of online conversation can quickly identify the solution for it so i am not replicating them here.
However, I did stuck by a issue when running evaluation job - "cannot import pycocotool" and for which i found the solution here: https://github.com/tensorflow/models/issues/3470
Now, both of my training and evaluation jobs are up and running. However, it seems strange that I couldn't see any statistics (ex.loss plot in orange) show up for evaluation job on the tensorbroad's scalar display (Yet, I do see the eval job check-box shows up as a view option in it):
I have also checked the log in eval job and i found the node seems to constantly skipping the image. Is this the cause to the issue? May be some issue with the evaluation dataset?
Log info in eval job: