0

My goal is to test a custom object-detection training using the Google ML-Engine based on the pet-training example from the Object Detection API.

After some successful training cycles (maybe until the first checkpoint, since no checkpoint has been created) ...

15:46:56.784 global step 2257: loss = 0.7767 (1.70 sec/step)

15:46:56.821 global step 2258: loss = 1.3547 (1.13 sec/step)

... I received following error on several object detection training job trials:

Error reported to Coordinator: , {"created":"@1502286418.246034567","description":"OS Error","errno":104,"file":"external/grpc/src/core/lib/iomgr/tcp_posix.c","file_line":229,"grpc_status":14,"os_error":"Connection reset by peer","syscall":"recvmsg"}

I received it on worker-replica-0,3 and 4. Afterwards the job fails:

Command '['python', '-m', u'object_detection.train', u'--train_dir=gs://cartrainingbucket/train', u'--pipeline_config_path=gs://cartrainingbucket/data/faster_rcnn_resnet101.config', '--job-dir', u'gs://cartrainingbucket/train']' returned non-zero exit status -9

I'm using an adaptation of the faster_rcnn_resnet101.config, with following changes:

train_input_reader: {
  tf_record_input_reader {
    input_path: "gs://cartrainingbucket/data/vehicle_train.record"
  }
  label_map_path: "gs://cartrainingbucket/data/vehicle_label_map.pbtxt"
}

eval_config: {
  num_examples: 2000
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "gs://cartrainingbucket/data/vehicle_val.record"
  }
  label_map_path: "gs://cartrainingbucket/data/vehicle_label_map.pbtxt"
  shuffle: false
  num_readers: 1
}

My bucket looks like this:

cartrainingbucket (Regional US-CENTRAL1)
--data/
  --faster_rcnn_resnet101.config
  --vehicle_label_map.pbtxt
  --vehicle_train.record
  --vehicle_val.record
--train/ 
  --checkpoint
  --events.out.tfevents.1502259105.master-556a4f538e-0-tmt52
  --events.out.tfevents.1502264231.master-d3b4c71824-0-2733w
  --events.out.tfevents.1502267118.master-7f8d859ac5-0-r5h8s
  --events.out.tfevents.1502282824.master-acb4b4f78d-0-9d1mw
  --events.out.tfevents.1502285815.master-1ef3af1094-0-lh9dx
  --graph.pbtxt
  --model.ckpt-0.data-00000-of-00001
  --model.ckpt-0.index
  --model.ckpt-0.meta
  --packages/

I run the job using following command (using a windows cmd [^ should equal ]:

gcloud ml-engine jobs submit training stefan_object_detection_09_08_2017i ^
--job-dir=gs://cartrainingbucket/train ^
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz ^
--module-name object_detection.train ^
--region us-central1 ^
--config object_detection/samples/cloud/cloud.yml ^
-- ^
--train_dir=gs://cartrainingbucket/train ^
--pipeline_config_path=gs://cartrainingbucket/data/faster_rcnn_resnet101.config

the cloud.yml is the default one:

trainingInput:
  runtimeVersion: "1.0" # i also tried 1.2, in this case the failure appeared earlier in training
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerCount: 5
  workerType: standard_gpu
  parameterServerCount: 3
  parameterServerType: standard

I'm using the currently latest Tensorflow Model master branch version (commit 36203f09dc257569be2fef3a950ddb2ac25dddeb). My locally installed TF version is 1.2 and I'm using python 3.5.1.

My training and validation records both work locally for training.

For me, as a Newbie, it's hard to see the problem's source. I'd be happy for any advice.

Horst Lemke
  • 349
  • 3
  • 14
  • Your parameter server likely went down. There are techniques to be more robust to that, but We would need to verify your code. Is it available? How similar is it to the pet sample? It may also be helpful if you provide us a dump of your logs. – rhaertel80 Aug 10 '17 at 02:29
  • We are training on arieal images that are cropped to slices of about 1000x1000 pixels. On some images there are about 50 objects. We disabled the learning from a checkpoint. – Horst Lemke Aug 10 '17 at 06:05

2 Answers2

1

Update: The job failed due to out-of-memory. Try to use larger machine instead please.

In addition to rhaertel80's answer, it will be also helpful if you can share the project number and job id with us via cloudml-feedback@google.com.

Guoqing Xu
  • 482
  • 3
  • 9
1

One possibility is that TF processes are using to much resources (usually memory) and being killed by the OS. This would explain the connection reset by peer.

So one thing to try would be using a tier/machines with more resources.

Jeremy Lewi
  • 6,386
  • 6
  • 22
  • 37