tensorflow slim multi-GPU can't work

Question

Currently I use tensorflow slim to train the model from scrach. If I just follow the instruction here https://github.com/tensorflow/models/tree/master/slim#training-a-model-from-scratch, everything is OK.

However, I want to use multi GPU, so I set --num_clones=2 or 4, both of them are not working. The result is that both of them get stuck at global_step/sec: 0. They can't continue. You can see the result image here error result

DATASET_DIR=/tmp/imagenet
TRAIN_DIR=/tmp/train_logs
python train_image_classifier.py \
--num_clones=4 \
--train_dir=${TRAIN_DIR} \
--dataset_name=imagenet \
--dataset_split_name=train \
--dataset_dir=${DATASET_DIR} \
--model_name=inception_v3

Hope someone can help me, thanks in advance. By the way, I use tensorflow 1.1 & python 35 on Ubuntu 16.04. If you need more information, please let me know.

It takes longer to build the graph when using multiple GPUs. If you wait enough, do you see a problem? If you kill the program with ctrl+c, what is the stack trace? — Alexandre Passos, Jun 27 '17 at 20:06
Hi, thanks for you reply. I try the program again and have waited for more than half an hour. It still get stuck at the beginning. However, I can't kill the program with ctrl+c. I can only stop the program by pressing ctrl+z, and but the processes still occupy the resource. So I need to release the resource by using kill -9 PID. Also I have update the error image, you can refer to the 'error result' above. — happenzZ, Jun 28 '17 at 08:07

foabodo · Answer 1 · 2017-11-14T04:44:32.350

0

Your issue resembles an experience I had after switching from a single-GPU to a multi-GPU configuration using tf-slim. I observed that the parameter server job assumed the name 'localhost', which conflicted with the default job name assigned by model_deploy to my CPU device. I suggest you inspect the device names by following the "Logging Device placement" section of this tensorflow.org article. It explains how to print device names to the console on a per-operation basis. You can then pass the actual job name as an argument to DeployConfig()'s ps_job_name parameter and proceed with training.

edited Nov 14 '17 at 04:44

answered Nov 14 '17 at 02:40

foabodo

56
3

This does not provide an answer to the question. Once you have sufficient [reputation](https://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](https://stackoverflow.com/help/privileges/comment); instead, [provide answers that don't require clarification from the asker](https://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-can-i-do-instead). - [From Review](/review/low-quality-posts/17935656) – Nabin Nov 14 '17 at 03:26
Thanks for this explanation. I should be able to edit my answer so that it doesn't solicit feedback from the questioner. – foabodo Nov 14 '17 at 04:32

tensorflow slim multi-GPU can't work

1 Answers1