0

I am trying to modify the code of mask rcnn to run it on multi-gpu, based on the sample of cifar10, the most part of code is below

One image and ground truth infomation is read from TFRecords file as below

    image, ih, iw, gt_boxes, gt_masks, num_instances, img_id = \
        datasets.get_dataset(FLAGS.dataset_name,
                            FLAGS.dataset_split_name,
                            FLAGS.dataset_dir,
                            FLAGS.im_batch,
                            is_training=True)

Here the size of image and num_instance is different among images, then these inputs are stored in an RandomShuffleQueue as below

    data_queue = tf.RandomShuffleQueue(capacity=32, min_after_dequeue=16,
            dtypes=(
                image.dtype, ih.dtype, iw.dtype,
                gt_boxes.dtype, gt_masks.dtype,
                num_instances.dtype, img_id.dtype))

    enqueue_op = data_queue.enqueue((image, ih, iw, gt_boxes, gt_masks, num_instances, img_id))
    data_queue_runner = tf.train.QueueRunner(data_queue, [enqueue_op] * 4)
    tf.add_to_collection(tf.GraphKeys.QUEUE_RUNNERS, data_queue_runner)

the I use tower_grads to gather the gradients in each GPU, then average them, below is the code for multi-gpu

    tower_grads = []
    num_gpus = 2
    with tf.variable_scope(tf.get_variable_scope()):
        for i in xrange(num_gpus):
            with tf.device('/gpu:%d' % i):
                with tf.name_scope('tower_%d' % i) as scope:

                    (image, ih, iw, gt_boxes, gt_masks, num_instances, img_id) =  data_queue.dequeue()
                    im_shape = tf.shape(image)
                    image = tf.reshape(image, (im_shape[0], im_shape[1], im_shape[2], 3))

                    total_loss = compute_loss() # use tensor from dequeue operation to compute loss

                    grads = compute_grads(total_loss)
                    tower_grads.append(grads)

    grads = average_grads(tower_grads)

when num_gpus=1, the code works well(I mean there is no error), but when I use two TITAN X GPUs, there are some strange errors below

  • failed to enqueue async me mset operation: CUDA_ERROR_INVALID_HANDLE
  • Internal: Blas GEMM launch failed

and the error is not the same when you run the code several times. I can't figure out why these errors occur for multi-gpu, some conflicts on data queue or GPUs?

D. Tony
  • 1
  • 1
  • FWIW it doesn't sound like this has anything to do with queues in the TensorFlow input pipeline sense. The "queue" referenced in the error message is the StreamExecutor for GPU operations. Wild guess: is there enough power available for both GPUs? – Allen Lavoie Jun 27 '17 at 18:00
  • @AllenLavoie, thanks for your reminding that the error is related to GPU operation. You guess that the error may come from insufficient GPU memory, so I want to know if the memory for one GPU is okay, when I use two GPUs, why is the memory not sufficient? – D. Tony Jun 28 '17 at 13:46
  • I mean power as in the physical quantity (watts). If both of them work individually but there are problems when they're both working, it's possible that the PSU is being maxed out (note that they only draw significant amounts of power when under load). Looking at `watch nvidia-smi` might help. – Allen Lavoie Jun 29 '17 at 00:45

0 Answers0