Enabling XLA JIT with multi-gpu from tf.slim

Question

I turned on xla at tf.slim with multi-gpu(2 titanXp) as below. (edit train_image_clasifier.py)

   jit_config = tf.ConfigProto()
   jit_level = tf.OptimizerOptions.ON_1
   jit_config.graph_options.optimizer_options.global_jit_level = jit_level

   ###########################
   # Kicks off the training. #
   ###########################

   slim.learning.train(
       .... (same with original)  
       sync_optimizer=optimizer if FLAGS.sync_replicas else None,
       session_config=jit_config)

And, I did run like following command.

>> python train_image_classifier.py 
--train_dir=/tmp/imagenet_train --dataset_name=imagenet 
--dataset_split_name=train --dataset_dir=$DATA_DIR
--model_name=inception_v3 --max_number_of_steps=5000 
--batch_size=64 --num_clones=2

However, I got these error messages.

2017-07-31 18:34:05.231408: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_1_bfc) ran out of memory trying to allocate 146.34MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-07-31 18:34:05.246182: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_1_bfc) ran out of memory trying to allocate 1.48GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
......
2017-07-31 18:34:06.311713: E tensorflow/stream_executor/cuda/cuda_driver.cc:1068] failed to synchronize the stop event: CUDA_ERROR_ILLEGAL_ADDRESS
2017-07-31 18:34:06.311761: E tensorflow/stream_executor/cuda/cuda_timer.cc:54] Internal: error destroying CUDA event in context 0xd25d400: CUDA_ERROR_ILLEGAL_ADDRESS
2017-07-31 18:34:06.311790: E tensorflow/stream_executor/cuda/cuda_timer.cc:59] Internal: error destroying CUDA event in context 0xd25d400: CUDA_ERROR_ILLEGAL_ADDRESS
2017-07-31 18:34:06.311855: E tensorflow/stream_executor/cuda/cuda_timer.cc:54] Invalid argument: input event cannot be null
2017-07-31 18:34:06.311869: E tensorflow/stream_executor/cuda/cuda_timer.cc:59] Invalid argument: input event cannot be null
2017-07-31 18:34:06.311942: E tensorflow/stream_executor/cuda/cuda_timer.cc:54] Invalid argument: input event cannot be null
2017-07-31 18:34:06.311959: E tensorflow/stream_executor/cuda/cuda_timer.cc:59] Invalid argument: input event cannot be null
2017-07-31 18:34:06.311973: E tensorflow/compiler/xla/service/gpu/convolution_thunk.cc:325] No convolution algorithm works with profiling. Fall back to the default algorithm.
2017-07-31 18:34:06.311983: E tensorflow/compiler/xla/service/gpu/convolution_thunk.cc:334] No convolution algorithm without scratch works with profiling. Fall back to the default algorithm.
2017-07-31 18:34:06.312037: F tensorflow/stream_executor/cuda/cuda_dnn.cc:2877] failed to enqueue convolution on stream: CUDNN_STATUS_EXECUTION_FAILED
Aborted (core dumped)

Previously, xla jit was working well with these flags(--batch_size=32 --num_clones=1). I think there is a bug on xla about buffer allocation. Can anyone help me what I did something wrong?

Enabling XLA JIT with multi-gpu from tf.slim

0 Answers0