1

I've successfully trained models using Tensorflow's Object Detection API running both locally on GPU (using model_main.py) and using Google's ML Engine (both GPU and TPU). However, I can't seem to be able to use model_tpu_main.py to train a model, when running on on Google's Cloud (using a manually provisionned VM and TPU).

When I launch model_tpu_main.py using something like python -m object_detection.model_tpu_main --model_dir=gs://bucket/training --tpu_zone us-central1-b --pipeline_config_path=gs://bucket/training/pipeline.config --job-dir gs://bucket/training --tpu_name mytpu_name, it gets stuck on:

...
W1113 03:05:16.628712 139998232708864 variables_helper.py:144] Variable [resnet_v1_50/fpn/smoothing_2/BatchNorm/moving_mean] is not available in checkpoint
W1113 03:05:16.629062 139998232708864 variables_helper.py:144] Variable [resnet_v1_50/fpn/smoothing_2/BatchNorm/moving_variance] is not available in checkpoint
W1113 03:05:16.629330 139998232708864 variables_helper.py:144] Variable [resnet_v1_50/fpn/smoothing_2/weights] is not available in checkpoint
2018-11-13 03:06:08.618834: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
...

Looking at the TPU logs, pretty much all I get is:

...
Start master session b9186abfa4e15b1d with config: isolate_session_state: true A 
Start master session 48b812f9ca0d3ebf with config: isolate_session_state: true A 
Start master session 33048226cb131f4c with config: isolate_session_state: true A 
Start master session cab95e277a429f9d with config: isolate_session_state: true A 
Start master session 56b5d3296c9bfe15 with config: isolate_session_state: true A 
Start master session 3fdac64b285c365d with config: isolate_session_state: true A 
Start master session ec1fa14806ad9351 with config: isolate_session_state: true A 
...

Any idea what I'm doing wrong?

michaelb
  • 252
  • 1
  • 6
Simon Labrecque
  • 577
  • 7
  • 15

0 Answers0