0

Currently I use tensorflow slim to train the model from scrach. If I just follow the instruction here https://github.com/tensorflow/models/tree/master/slim#training-a-model-from-scratch, everything is OK.

However, I want to use multi GPU, so I set --num_clones=2 or 4, both of them are not working. The result is that both of them get stuck at global_step/sec: 0. They can't continue. You can see the result image here error result

DATASET_DIR=/tmp/imagenet
TRAIN_DIR=/tmp/train_logs
python train_image_classifier.py \
--num_clones=4 \
--train_dir=${TRAIN_DIR} \
--dataset_name=imagenet \
--dataset_split_name=train \
--dataset_dir=${DATASET_DIR} \
--model_name=inception_v3

Hope someone can help me, thanks in advance. By the way, I use tensorflow 1.1 & python 35 on Ubuntu 16.04. If you need more information, please let me know.

happenzZ
  • 1
  • 2
  • It takes longer to build the graph when using multiple GPUs. If you wait enough, do you see a problem? If you kill the program with ctrl+c, what is the stack trace? – Alexandre Passos Jun 27 '17 at 20:06
  • Hi, thanks for you reply. I try the program again and have waited for more than half an hour. It still get stuck at the beginning. However, I can't kill the program with ctrl+c. I can only stop the program by pressing ctrl+z, and but the processes still occupy the resource. So I need to release the resource by using kill -9 PID. Also I have update the error image, you can refer to the 'error result' above. – happenzZ Jun 28 '17 at 08:07

1 Answers1

0

Your issue resembles an experience I had after switching from a single-GPU to a multi-GPU configuration using tf-slim. I observed that the parameter server job assumed the name 'localhost', which conflicted with the default job name assigned by model_deploy to my CPU device. I suggest you inspect the device names by following the "Logging Device placement" section of this tensorflow.org article. It explains how to print device names to the console on a per-operation basis. You can then pass the actual job name as an argument to DeployConfig()'s ps_job_name parameter and proceed with training.

foabodo
  • 56
  • 3
  • This does not provide an answer to the question. Once you have sufficient [reputation](https://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](https://stackoverflow.com/help/privileges/comment); instead, [provide answers that don't require clarification from the asker](https://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-can-i-do-instead). - [From Review](/review/low-quality-posts/17935656) – Nabin Nov 14 '17 at 03:26
  • Thanks for this explanation. I should be able to edit my answer so that it doesn't solicit feedback from the questioner. – foabodo Nov 14 '17 at 04:32